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(57) Abstract: The invention relates to improved methods for directed evolution of polymers, including directed evolution of nucleic 
acids and proteins. Specifically, the methods of the invention include analytical methods for identifying "crossover locations" in a 
polymer Crossovers at these locations are less likely to disrupt desirable properties of the protein, such as stability or functionality. 
The invention further provides improved methods for directed evolution wherein the polymer is selectively recombined at the iden- 
tified "crossover locations". Crossover disruption profiles can be used to identify preferred crossover locations. Structural domains 
of a biopolymer can also be identified and analyzed, and domains can be organized into schema. Schema disruption profiles can 
be calculated, for example based on conformational energy or interatomic distances, and these can be used to identify preferred or 
candidate crossover locations. Computer systems for implementing analytical methods of the invention are also provided. 
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GENE RECOMBINATION AND HYBRID PROTEIN DEVELOPMENT 



This application claims the benefit of U.S. Patent Application Serial No. 
10/016,668, filed October 26, 2001, which is hereby incorporated by reference in its 
entirety. 

Numerous references, including patents, patent applications and various 
5 publications are cited and discussed in this specification. The citation and/or discussion 
of such references is provided to clarify the description of the invention and is not an 
admission that any such reference is "prior art" to the invention described herein. All 
references cited and discussed in this specification are incorporated by reference in their 
entirety and to the same extent as if each reference was individually incorporated by 
10 reference. 

Work leading to this invention was supported by Grant No. N00014-96-1-0340 
awarded by the United States Navy. The United States government may have certain 
rights to this invention, pursuant to the term of that grant. 

15 1. FIELD OF THE INVENTION 

The invention relates to biomolecular engineering and design, including methods 
for the design and engineering of biopolymers such as proteins and nucleic acids. 

More particularly, the invention relates to improved methods for in vivo and in 
vitro directed evolution of biopolymers, such as polypeptides (e.g. proteins) and 
20 oligonucleotides (e.g. DNA and RNA). The invention is particularly suited to techniques 
which generate hybrid biopolymers by recombining sequences of biopolymer building 
blocks, such as sequences of amino acid residues or nucleic acid residues, from more than 
one parent biopolymer (e.g. from two or more parent genes). This can be referred to as 
"crossing" two or more parents to produce recombinant offspring. Each location in the 
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offspring where the biopolymer sequence changes or "crosses over" from one parent to 
another is called a "crossover location" or a "cut point." A related term, known in the 
genetic algorithm literature, is "schema." In the context of protein engineering, a schema 
is a representation or arrangement of polymer building blocks, such as nucleic or amino 

5 acid residues, or recognizable structural domains or energetic conformations, in which 
each building block contributes more or less to the structural integrity, form, function, or 
fitness of the polymer. In a recombination experiment, parents may have similar or 
different schema, and the offspring may preserve or disrupt, the schema of one or more 
parents. In a preferred embodiment of the invention, schema that are common to two or 

1 0 more parents are preserved in recombinant offspring. 

The invention provides computational methods for predicting beneficial 
recombinations of biopolymers, e.g. the fragments, locations or schema of two or more 
parent genes which can advantageously be recombined. Directed evolution methods can 
be selected and applied to favor identified recombinations. By applying cut points at 

1 5 locations that preserve schema, the recombinant mutant library has a larger fraction of 
folded, stable hybrids or chimeras. Because the stability of the wild type is preserved, it 
is more likely that mutants exist in this library that have improvements in the desired 
properties. 

For example, recombinant protocols can be modeled in silico to predict crossover 
20 locations which will tend to preserve and not disrupt, advantageous schema. The 
computational or in silico techniques of the invention can be used to determine preferred 
crossover locations. Residues of one or more biopolymers are identified (e^g. nucleotide 
residues of a nucleic acid or amino acid residues of a polypeptide) where crossover 
recombination may produce beneficial results, such as one or more improved properties. 
25 Preferably, improvements are obtained while minimally disrupting a desired biopolymer 
property, such as stability or functionality. Disruption is less likely when biopolymers 
are cut and recombined at structurally tolerant crossover sites determined according to 
the invention. Crossover locations on parent biopolymers are identified which tend to 
have little or no impact on the stability of the three-dimensional structure of the 
30 biopolymer, represented e.g., as schema, according to specified thresholds or parameters. 
These locations can be used as candidate crossover locations for recombination 
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experiments. Alternatively, sets of interacting residues or schema can be identified which 
are collectively crucial or important to the structure of the biopolymer, according to 
specified threshold or parameters. Crossovers that disrupt these sets of beneficially 
interacting residues or schema are not desirable because they lead to destabilized 
5 structures, and thus can be ruled out. 

These techniques provide a targeted approach for obtaining mutant or hybrid 
biopolymers with improved properties using directed evolution. For example, the 
invention is useful in the design of in vitro recombination experiments where nucleic acid 
sequences that encode two or more different parent proteins maybe recombined to create 

10 hybrid sequences. Unlike other directed evolution methods, such as family shuffling, 
that require high sequence identity or similarity (e.g. 70% or higher), the invention can 
be applied to parent proteins of low sequence similarity, e.g. less than 50%, or of no 
sequence similarity (0%). For example, cut points for the recombination of proteins are 
selected based on preserving three-dimensional or conformational structure or structural 

15 motifs. Common structures or domains can be identified independently of amino acid 
sequence, or without requiring overall sequence similarity. Widely different sequences 
may code for the same or similar structures or schema. Different proteins with different 
functions may have similar structures. Such proteins can be identified and selected as 
parents for crossover recombination, at selected cut points which preserve or minimize 

20 disruption of common structures. This improves the likelihood of producing mutants 
with functions or properties from more than one parent. For example, a protease of high 
activity may be recombined, at selected cut points, with. a second. structurally similar, 
protein of high thermal stability, to produce a thermostable protease with high activity. 
By focusing on structural similarity and by minimizing structural disruption, the 

25 invention provides mutants having new or improved properties, without needing to rely 
on serendipitous results from random recombinations of parents having a high sequence 
similarity. Recombination based on hybridization or sequence identity can be called 
"homologous" recombination. Recombination that is not based on sequence identity can 
be called "non-homologous" recombination. The invention encompasses both methods, 

30 which can be used independently, or together. 
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2. BACKGROUND OF THE INVENTION 

The invention is concerned with polymers, primarily biopolymers such as 
polynucleotides (chains of nucleic acids, e.g. DNA and RNA) and polypeptides (chains 
of amino acids, e.g. proteins and enzymes). More particularly, the invention provides 
5 improved hybrid proteins and methods of obtaining them by crossover recombination. 

Proteins are polypeptides that are useful to living organisms. For example, they 
provide structures in the body, do physical or chemical work, or act as catalysts for 
chemical reactions (i.e. as enzymes). Proteins are made by cells according to genetic 
information encoded, transcribed and translated by polynucleotides (DNA and RNA). 

10 It is often desirable to modify proteins so that they have new or improved properties. For 
example, a protein may be altered to increase its biological activity (e.g. its potency as 
an enzyme), or to improve its stability under different environmental conditions (e.g. 
temperature), or to change its function (e.g. to catalyze a different chemical reaction). 
Nature makes these kinds of alterations in many ways, including for example 

1 5 genetic mutations, or changes due to the recombination of genetic material such as occurs . 
from sexual reproduction. Changes that are beneficial tend to be preserved from 
generation to generation, while truly harmful changes may disappear over time, in a 
process called evolution. Changes which are neutral, i.e. neither helpful nor harmful, 
may also be preserved by default. This is a very long process, and tends to produce 

20 random changes which are then tested for survival by the environment Scientists 
looking for proteins with improved properties have had the very difficult task of 
searching for changes in proteins at random, from the vast numbers of potential natural 
sources that are available. Changes that are desirable may not be produced or preserved 
by nature. Breeding experiments can be done to provide additional sources for genetic 

25 variation, tending toward traits of interest, but these techniques also are exceedingly slow, 
costly, and resource intensive. They are very inefficient, and may not produce desired 
results. For example, proteins that act as enzymes to break down other proteins can be 
used as stain-removing ingredients of a laundry detergent, but these proteins may have 
to work at higher temperatures than in nature. 

30 Identifying proteins with desirable characteristics from nature, such as enzymes 

with improved heat resistance (thermal stability) or other fitness characteristics, has been 
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a haphazard and difficult process. Accordingly, there has been a need for new ways to 
modify proteins, or the polynucleotides which encode them, to produce new proteins with 
improved properties or fitness. Two separate techniques commonly used to alter the 
properties of proteins and other biological molecules are directed evolution and 
5 computational design. The invention brings these techniques together, and in particular 
provides guided processes of genetic diversity that reduce the sequence space to be 
searched, are less prone to random results, and are more prone to produce proteins with 
improved fitness. According to the invention, preferred or optimal cut points for 
recombination, fragment sizes, and recombination strategies are provided. Structural 

10 information about parent proteins, such as knowledge of epitopes or active sites, or 
results of prior mutagenesis experiments, can be used to improve the outcome of protein 
evolution experiments. Other factors, such as library size and landscape data (e.g. 
structure/function relationships) can also be taken into account. Principles of statistical 
mechanics are applied to genetic algorithms, to produce computational models of 

15 evolutionary processes. These models correlate with observations and experiments in 
directed evolution, and can be adapted to different experimental designs. The 
computational models can also be used to provide a protein design model, which 
generates candidate recombinants in silico more rapidly than conventional in vitro 
methods, thus allowing experimental parameters to be rapidly tested and optimized. 

20 Directed Evolution 

Directed evolution techniques attempt to alter the properties of abiopolymer (e.g., 
a protein or a nucleic acid) by accumulating stepwise improvements through iterations 
of random mutagenesis, recombination and screening. See, e.g., Moore & Arnold, Nature 
Biotechnology 1996, 14:458; Miyazaki et al, J. Mol. Biol. 2000, 297: 1 01 5-1026; Arnold, 

25 Adv. Protein Chem. 2000, 55:ix-xi. Broadly speaking, these methods work by speeding 
up the natural processes of evolution. Changes in genetic material (e.g. mutations) are 
rapidly and artificially induced, typically in cells that can be easily and quickly grown in 
cell culture (e.g. outside the body). The resulting mutants are rapidly evaluated to 
identify new or improved properties or changes of interest. 

30 In a typical in vitro protein evolution experiment, a naturally occurring or wild- 

type protein is identified, and its sequence is altered to produce diversity, for example by 
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mutation or recombination. This results in large numbers of mutant proteins, which are 
screened according to appropriate fitness criteria, for example, the most active mutants 
that are reasonably stable may be selected. One or more of these mutants may then be 
selected as a parent for another round of evolution. This process may be repeated as 
5 desired, for example until no further improvements in fitness are observed. 

Genetic recombination methods have been widely applied to accelerate in vitro 
protein evolution. Examples include DNA shuffling, random-priming recombination, 
and the staggered extension process (StEP). See e.g., Stemmer, Proc. Natl Acad. Sci., 
91 : 1 0747 (1 994); Stemmer, Nature, 370:389 ( 1 994); Zhao & Arnold, Nucleic Acids Res., 

10 25:1307 (1997); Zhao et al, Nature Biotechnology, 49:290 (1998); Crameri et al„ 
Nature, 391:288 (1998), Volkov et al., Methods Enzymol, 382:447-456 (2000). 

Some of the advantages of directed evolution methods are that they can be used 
with large polymers, for example proteins with more than 5 00 amino acids ; they produces 
unique and unexpected results; and polymers can be evolved to achieve several goals 

1 5 simultaneously. Some disadvantages are that directed evolution is limited by the genetic 
code. For example, there are sixty-four 3-base nucleic acid codons that code for 20 
amino acids. A single mutation in a codon may not be enough for a wild-type amino acid 
to be changed into all 1 9 other possible amino acids. Often, two or more DNA mutations 
in the codon are required. In directed evolution experiments, the DNA mutation rate is 

20 small and the gene is large, so the probability of obtaining two neighboring DNA 
mutations is small. Practically, this means that not all amino acid mutations are possible 
using random mutagenesis alone. Nevertheless the number of hybrids which can be 
produced is vast, but even then they can not be made and screened as readily as would 
be desired. It is also difficult, to produce simultaneous non-additive arrangements of 

25 sequences. A non-additive effect means that two or more simultaneous mutations have 
to be made in order to observe a fitness improvement. Often, the individual mutations 
lead to a decreased fitness. Because the mutation rate is small and the gene is large, there 
is a very small probability of obtaining the precise multiple-mutant needed to observe a 
non-additive change, and one that provides a benefit or fitness improvement 
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Computational Design 

Computational design, by contrast, has developed separately from directed 
evolution and is a fundamentally different approach. See, Street & Mayo, Structure 
7:R105 (1999). Unlike the essentially random approach of directed evolution, 
5 computational design attempts to predict and then make the changes or mutations that 
will be beneficial or useful. Thus, the general objective of computational design is to 
identify particular interactions in a protein (or other biopolymer) that lead to desirable 
properties, and then modify the biopolymer sequence to optimize those interactions. For 
example, a force-field model can be used to quantitatively describe interactions between 

1 0 amino acid residues in a protein. An amino acid sequence may then be computed, at least 
in theory, to globally optimize these interactions. See e.g., Malakaukas & Mayo, Nature 
Structural Biology, 5:470 (1998); Dahiyat& Mayo, Science, 278:82 (1997). 

Some of the advantages of computational protein design are that very large 
numbers of sequences can be screened in silico, e.g. io 20 * 300 ; multiple mutations can be 

1 5 considered simultaneously; and all possible amino acid substitutions (the entire possible 
sequence space) can be searched. Some disadvantages are that computational 
requirements increase exponentially with larger polymer sequences; at least some 
structural information (e.g. a defined secondary sequence) is needed; and certain unique 
or unexpected possibilities may be overlooked because the polymer backbone is held 

20 constant for the calculations. In addition, it takes considerable if not restrictive 
computing power and computation time to calculate detailed energies between all 
possible amino acid combinations. 
The Sequence Space 

Computational design can effectively search a large sequence space, that is, a 

25 large number of sequences (e.g., > 10 26 ). See, Dahiyat& Mayo, Science 278:82 (1997). 
However, the technique is currently limited by the size of the biopolymer. The largest 
full sequence design accomplished to date is a 28-mer zinc finger protein (id.). Partial 
designs can be done to improve the stability of proteins up to about 70 amino acids. 
Moreover, the technique currently is based on calculating the molecule ! s conformational 

3 0 energy, i.e. the relative energy of the molecule's folded and unfolded states. Thus, current 
computational methods have only been used to improve a molecule's stability. The 
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technique has not been used to improve other properties ofbiopolymers, such as activity, 
selectivity, efficiency, or other characteristics of biological fitness. 

Directed evolution methods, by contrast, have the benefit of improving any 
property in a molecule that can be detected and/or captured by a screen, for example 
5 catalytic activity of an enzyme. One effective and widely used directed evolution method 
involves production of a library of mutants from a parent sequence, e.g., by using 
error-prone PCR to produce random point mutations. Moore & Arnold, Nature 
Biotechnology, 14:458 (1996); Miyazaki et al, J. Mol. Biol., 297:1015-1026 (2000). 
However, the technique is limited by several factors, one of which is the practical size of 

10 the screen. Zhao & Arnold, Cuir. Op. St. Biol, 7, 480-485 (1997). Increasing the 
number of mutants screened enables the user to sample a larger fraction of possible 
sequences (a larger sequence space) and therefore provides better improvements in the 
properties of interest. However, the most mutants that maybe observed in any practical 
screen or selection is between about 10 3 to 10 12 , depending upon the specific screening 

15 method. In comparison, however, an average protein of 300 residues will have at least 
10 390 possible amino acid combinations. Thus, any practical screening or selection assay 
can only search a small fraction of the possible sequences. 

Moreover, the probability that any single random mutation will improve a 
property of the parent sequence is small, and the probability of improvement decreases 

20 rapidly when multiple simultaneous mutations are made. Furthermore, the negligible 
probability that two or three mutations occur in a single codon and the significant biases 
of error-prone PCR severely restrict the possible amino acid substitutions which maybe 
searched. Again, there is a need to reduce the sequence space which must be searched 
in order to obtain desirable hybrids. 

25 Family Shuffling (Recombination of Divergent Homologous Sequences) 

Accumulating point mutations in a single sequence is an effective fine-tuning 
mechanism for directed evolution, but other methods can also be used to create molecular 
diversity, e.g. polymer sequences from which useful sequences can be identified by 
screening or selection. Mutations can be produced in vitro using error-prone PCR 

30 methods. Beneficial mutations can then be combined using genetic recombination 
methods. For example, a parent (e.g. wild-type) can be mutated to create a mutant 
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library, which is then screened for desirable mutants. These mutants can then be used as 
parent genes in recombination experiments. The mutant parents are cut into fragments 
and the fragments are recombined to provide a library of recombinant mutants. The 
recombinant mutants can then be screened for beneficial or improved properties. 
5 Recombination can be done without mutagenesis of a common parent. For 

example, two or more different but related parent genes can be recombined in a method 
known as "family shuffling" or "DNA shuffling." Related sequences, e.g. from divergent 
homologous genes, can be cut and recombined to make hybrid genes, These methods 
generally rely on an assumption that the parent genes share closely related structures. 

10 See, e.g., Stemmer, Nature, 370:389 (1994); Volkov, A.A., et al., Methods Enzymol., 
382:447-456 (2000); Crameri et al, Nature, 391:288 (1998). The shuffling process 
creates a library of many new genes which code for proteins with sequence information 
from any or all parents. For example, the first half of the sequence might come from one 
parent, while the second half might come from another. Another hybrid might have the 

15 first 20 nucleotides from one parent, the next 500 from another parent, and the last 
nucleotides from a third parent. The point at which a sequences derived from one parent 
switches to a sequence derived from another parent is called a crossover. There maybe 
one or more crossovers in a given sequence. 

A library of such hybrid genes might contain millions or trillions of different 

20 - genes containing different patterns of crossovers. In family shuffling, genes from- 
multiple parents and even from different species can be recombined, operations that do 
not occur in nature but which may nonetheless be useful for rapid adaptation. DNA 
shuffling is being used to generate improved proteins, and notably, proteins with features 
not present in one or all parent proteins, or not even known to occur in nature. See, 

25 Affholter & Arnold, "Engineering a revolution," Chemistry in Britain, 35: 48-51 (1999); 
Ness et al., "Molecular Breeding - the natural approach to enzyme design," Advances in 
Protein Chemistry, 55:261-292 (2000); Schmidt-Dannert, etal., "Molecular breeding of 
carotenoid biosynthetic pathways," Nature Biotechnology, 18, 750-753 (2000). 

DNA shuffling methods rely on hybridization between portions of the parent 

30 genes and can therefore only recombine closely related sequences, usually of more than 
70% sequence identity. Furthermore, these methods generate crossovers between one 
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parent sequence and another only in regions of the gene where there is high identity 
between the two sequences. Stated another way, recombination based on DNA sequence 
similarity requires overlap in the DNA between parents for a crossover to occur. The 
DNA of the parents is fragmented, and in order for the fragments to reanneal, they need 
5 to share some overlap to allow for DNA hybridization. The StEP protocol does not 
require as much overlap as the DNA shuffling protocol originally proposed by Stemmer. 
A variety of other shuffling techniques are also known, some of which do not require 
sequence identity or alignments. These include for example the ITCHY protocol. 
Ostermeieretal.,Bioorganic& Medicinal Chem. 7:2139-2144(1999); Ostermeieretal., 
10 Nature Biotechnol. 17:1205-1209 (1999). 

Many proteins having similar three-dimensional structures show low or even no 
discernable sequence identity or similarity. Rational design (Mitra et al., Biochemistry, 
32: 12959-12967 (1993); Shimoji et al., Biochemistry, 37: 8848-8852 (1998)), 
computational approaches (Bogarad & Deem, Proc. Natl. Acad. Sci. USA, 96:2591- 
15 95(1999)), and combinatorial methods (Ostermier et al, Nature Biotechnology, 
17:1205-1209 (1999)) have shown that functional proteins can be obtained by 
recombination of such distantly related or low sequence similarity parent sequences. 
Accordingly there is a need for methods that can provide stable and functional hybrids 
from recombined parents having low or no sequence similarity or identity, but having 

20 three-dimensional structures in common. 

\ 

Recombination can be performed using so-called "non-homologous" methods that 
do not need sequence identity or overlap, because the experimental protocol relies on 
other properties, and does not require DNA hybridization between the parents. Generally, 
two parents are recombined with a single crossover point using such methods. If 

25 recombination is restricted to a single crossover point between two parents, the crossover 
disruption of the recombinant mutants may be very substantially increased, leading to a 
library of less-stable mutants. According to the invention, non-homologous 
recombination protocols can be modeled or used together with improved and targeted 
computational methods to calculate crossover disruption profiles. These can be applied 

30 to favorably restrict crossover locations, minimize disruption, and select crossover 
regions and mutants that are more likely to be stable, and/or exhibit improved fitness. 
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Functional Crossover Locations 

Random selection of crossover sites, as in conventional family shuffling, does not 
favor sites that are more likely to produce functional and improved mutants. 
Accordingly, methods of selecting promising crossover sites are needed. It has been 
5 empirically observed that functional shuffled sequences do not contain an even 
distribution of crossover locations throughout the sequence. For example, the crossover 
locations of some in vitro recombinant mutants are strongly biased towards the N- and 
C- termini of the resulting functional proteins. Ness etal, Nature Biotechnology, 1999, 
17:893-896. Many of these crossovers at the termini do not, however, lead to functional 
10 improvements. 

Sequence Databases 

Given the explosive growth in the gene databases due to the exhaustive 
sequencing of large numbers of organisms, the sequences of homologous genes are easily 
accessible. However, to date, there is no rigorous method in the art to quantitatively use 
15 the information in sequence databases to identify optimal starting parents for 
recombination (e.g. shuffling) experiments. A method to rapidly and quantitatively use 
such information is desirable. It is further desirable to have methods that predict where 
crossover locations in recombination experiments are likely to generate functional 
proteins which also may have new and useful properties. Such methods would be useful 
. - 20 for the. creation of more diversity, in a recombinant library,- with a reduction in the 
numbers of mutants needed to be produced and screened. Methods that would address 
these and/or other problems in the art would allow the acceleration of in vitro protein 
evolution and would accelerate the creation of new proteins (e.g. enzymes) with novel 
and useful properties. This is of particular interest to those interested in improved 
25 protein-based drugs, and in the use of enzymes in industrial processes where enzymes 
must function in non-native environments or must catalyze non-native chemical 
reactions. 

Thus, there is presently a need in the art for improved methods of designing 
biopolymers such as proteins and nucleic acids. Moreover, there exists a need for better 
30 methods for improving one or more properties of a biopolymer. There further exists a 
need for improved methods of directed evolution that overcome, at least partially, any 
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one or more of the above-described problems in the art. For example, there is a need in 
the art to identify regions in the sequence of a molecule (e.g., a biopolymer such as a 
protein or nucleic acid) where crossover recombination is likely to generate a library of 
stable mutants or chimeras that can be screened for one or more beneficial and/or 
5 improved properties. 

3. SUMMARY OF THE INVENTION 

Applicants have discovered that producing mutant biopolymers by crossover 
recombination at certain cut point or locations is more likely to preserve stability and/or 
10 a desired property of the polymer, such as functionality, than crossovers in other areas. 
The crossover locations are identified by examining at what locations a crossover disrupts 
a schema structural domain or a minimum of coupling interactions between amino acid 
side chains of the polymer (e.g. polypeptide). The invention provides novel techniques 
for identifying residue locations where crossovers would disrupt a minimum of schema 
15 or coupling interactions in a polypeptide. These methods are straightforward and are 
computationally tractable. 

Accordingly, a skilled artisan can readily use the methods to identify residues of 
a particular polymer sequence that permit crossover recombination with minimal 
disruption. The artisan may selectively recombine polymers at the identified crossover 

20 . locations to generate recombinant mutants that are likely to be functional, and which can 

be screened for properties of interest. Such mutants are more likely to have one or more 
properties of interest that are improved over the properties of the parent polymer. Thus, 
by selectively recombining parent genes at identified crossover locations e.g. in silico, 
a skilled artisan may more readily and efficiently identify novel sequences with improved 
25 properties than if the artisan used randomized methods or conventional shuffling. 

The invention therefore provides methods for selecting residues of a biopolymer 
sequence for crossover recombination by obtaining or detennining which locations 
disrupt a structural domain or a minimal amount of coupling interactions in the amino 
acid sequence, and selecting the identified crossover locations. The polymers maybe any 
30 type of polymer, including biopolymers such as, but not limited to, nucleic acids 
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(comprising a sequence of nucleotide residues) and proteins or polypeptides (comprising 
a sequence of amino acid residues). 

The invention also provides methods for the directed evolution of biopolymers. 
Two or more parent sequences are provided, each for example having one or more 
5 properties of interest, and one or more possible crossover locations. One or more 
recombinant polymers may then be generated from the parent polymer sequences, in 
which two or more of the parents are recombined at one or more selected crossover 
locations. These mutants are preferably screened for the one or more properties of 
interest. Mutants are selected where one or more properties of interest is modified and 

10 preferably is improved. In certain embodiments, the methods of the invention are 
iteratively repeated, and selected mutants are used as parent polymer sequences in 
subsequent iterations of the method.. 

The invention can also be used to identify optimal parent molecules (e.g. preferred 
parent genes) for recombination. Similar or structurally related parent molecules can be 

15 evaluated to determine which are more likely, when altered, to produce desirable 
improvements. For example, optimal parents can be mined from sequence databases, 
e.g. using disruption energy as a measure. 

Computer systems are also provided that maybe used to implement the analytical 
methods of the invention, including methods of identifying crossover locations in a 

-20 polymer sequence-and/or-selecting such residues for mutation (e.g., as part of a-directed 
evolution method). These computer systems comprise a processor interconnected with 
a memory that contains one or more software components. In particular, the one or more 
software components include programs that cause the processor to implement steps of the 
analytical methods described herein. The software components may further comprise 

25 additional programs and/or files including, for example, sequence or structural databases 
ofpolymers. 

Computer program products are further provided, which comprise a computer 
readable medium, such as one or more floppy disks, compact discs (e.g., CD-ROMS or 
RW-CDS), DVDs, data tapes, etc, that have one or more software components encoded 
30 thereon in computer readable form. In particular, the software components may be 
loaded into the memory of a computer system and may then cause a processor of the 
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computer system to execute steps of the analytical methods described herein. The 
software components may include additional programs and/or files including databases, 
e.g., of polymer sequences and/or structures. 

5 4. BRIEF DESCRIPTION OF THE DRAWINGS 

JIG. 1 is a flow diagram illustrating exemplary recombination embodiments of 
the methods of the invention. Fig. 1A illustrates a method for determining a schema 
disruption profile. Fig. IB illustrates a method for modeling an experimental 
recombinant protocol. 

10 FIG. 2 is a schematic illustration and graphical representation of crossover 

disruption. 

BIG. 3 is a gene alignment for P-lactamase-like genes, (1) Enterobacter cloacae, 
SEQ. ID NO. 1; (2) Citrobacter freundii, SEQ. ID NO. 2; (3) Yersinia enterocolitica, 
SEQ. ID NO. 3; and (4) Klebsiella pneumonia, SEQ. ID NO. 4. SWISPROT or 
1 5 TrEMBL accession numbers for the protein sequences and GenBank accession numbers 
for the DNA sequences are given. 

FIG. 4 A is an in silico probability distribution for all crossover locations 
calculated from a recombination algorithm for the four p- lactamase sequences of FIG. 
3. 

• 20 - Fl<x. 4B is an in -silico probability- distribution of crossover locations for P- 
lactamase when screened for crossover locations that meet a set threshold. In this 
example, recombinant mutants are below the threshold E c =14. The dark horizontal bars 
on the x-axis indicate the crossovers observed in prior in vitro experiment. Crameri et 
al n Nature, 391:288 (1998). These curves were calculated using Method 1 of the 

25 invention, described below. 

FIGS. 4C and 4D are similar to FIGS. 4A and 4B, but were calculated using 
Method 2 of the invention, described below. 

FIG. 5 is a crossover disruption plot for non-homologous recombination 
experiments, using the ITCHY protocol, with glycinamide ribonucleotide transformylase. 

30 The sequence range 50-100, where recombinations were restricted in the experiments, is 
shown on the x-axis. The crossover disruption is shown on the y-axis. 
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FIG. 6 shows a probability distribution for schema disruption in computationally 
generated recombinant mutants. The probability distribution of the schema disruption 
is plotted for the recombinant mutants that contain at least three parents and is 
normalized by the total number of mutants. Each distribution represents the schema 
5 disruption of the portion of the recombinant mutants that contain each parent sequence: 
(1) Enter obacter cloacae, (2) Citrobacter freundii, (3) Yersinia enterocolitica, and (4) 
Klebsiella pneumoniae. The portion of the distribution that corresponds to the low- 
schema disruption is to the left of the black line (Schema Disruption, S { < 18). In this 
region, the Klebsiella pneumonia (4) sequence corresponds with the least-disruptive 

1 0 schema. The addition of the Yersinia enter ocolitica (3) sequence causes the most schema 
disruption, explaining why it was not observed in the functional hybrid proteins found 
in DNA shuffling experiments. The inset bar graph shows the integral between the 
schema disruption cutoff and zero. This represents the fraction of low-disruption schema 
associated with each parent. 

15 FIG. 7 is an example of an in vitro method of overlap extension reassembly, 

targeting identified crossover locations. The appropriate fragments maybe obtained by 
split-pool synthesis. In FIG. 7, part (A), all possible recombinants are prepared by 
crossover at positions 1 and 2. In FIG 7, part (B), the recombinants can be prepared by 
assembly of synthetic fragments containing the crossover positions. This example 

20 - requires fragments (plus-end primers). - - - 

FIG. 8, part (A), shows a fragment reassembly method using a parental template. 
The synthetic fragments are extended against a parent template strand and the gaps are 
repaired. In FIG. 8, part (B), the resulting products are subjected to heteroduplex 
recombination (Volkov et al , Nucl Acids Res., 27: 1 8 (1 999)) to create libraries of genes 

25 within regions of non-identity. More complexity can be introduced by the addition of 
more fragments during template assembly. 

FIG. 9 shows the preparation of gene fragments prepared by PGR with primers 
directed to regions targeted for crossovers. In FIG. 9, part (A), the fragments are 
prepared by PCR with primers. The PCR reactions are performed with primers 1+2,3+4 

30 and 5+6. The method is repeated for the other parents. 
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FIG. 10 shows recombination directed to specific sites using crossover primers 
in DNA shuffling. In FIG. 10, part (A), crossover primers designed to have crossovers 
at designated positions (2 primers for each position) are prepared. In FIG. 10, part (B), 
the parent genes are fragmented and reassembled , utilizing PCR methods, in the presence 
5 of the crossover primers to promote recombination at designated positions. 

FIG. 11 shows an exemplary computer system that may be used to implement 
analytical methods of the invention. 

FIG. 12 is a flow diagram illustrating one embodiment of a recombinant search 
algorithm of the invention, based on sequence identity. In FIG. 12, part (1), the parent 
10 sequences are aligned with the template structure. In FIG 12, part (2), all possible 
crossover points are determined according to a sequence identity algorithm. In FIG. 12, 
part (3), the coupling matrix is calculated. In FIG. 12, part (4), a start parent is picked 
at random and copied to the offspring until a possible cut point is reached. In FIG. 12, 
part (5), a random number is picked, and if the number is less than p, a random new 
15 parent is copied until the next cut point is reached. In FIG. 12, part (6), the crossover 
disruption of the offspring gene is determined. 

FIG. 13 is a diagrammatic illustration of a computational algorithm used to 
generate recombinant mutants by DNA shuffling. (A) First, cut points are distributed 
randomly across the gene with probability p c . In this diagram, the arrows mark cut points 
20 and the thatched line represent regions of sequence similarity between parents. (B) A 
. parent is picked at random to determine the first fragment. The next fragment is chosen 
amongst the parents that share adequate sequence identity (including the parent of the . 
previous fragment) with equal probability. (C) The complete library of recombinant 
mutants that can be generated by the cut pattern shown. 
25 FIG. 14 is a flow chart of an exemplary algorithm for directed evolution 

experiments. 

FIG. 15 shows a quantitative comparison of the energy (x-axis) and distance (y- 
axis) based calculations of crossover disruption for Transformylase. An energy cutoff 
of 0.2 kcal/mol and a distance cutoff of 4.0 angstroms were used. The data fits a linear 
30 correlation with R 2 = 0.91 . 

FIG. 16 shows a comparison of crossover disruption calculations for 
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Transfonnylase based on the distance (top) and energy (bottom) definitions of coupling. 
An energy cutoff of 0.2 kcal/mol and a distance cutoff of 4.0 angstroms were used. The 
qualitative shapes of both plots are similar. 

FIG. 17 shows the crossover disruption of inserted phytase domains. The 
5 distance cut off d e was set to 3 .0 angstroms and the crossover disruption was normalized 
according to Equation (3). The experimental parameters are as reported by Lehmann and 
co-workers (2001). 

FIG. 18 is a schematic of the hierarchal process of protein folding. First, the. 
unfolded polypeptide rapidly collapses ( <4 bursts") into substructures. Next, the 

10 substructures condense to form the tertiary structure of the native protein. It is 
undesirable for crossovers to disrupt compact units that nucleate the remaining structure 
('^building blocks" or "schema"). 

FIG. 19 is a schematic demonstrating the utility of a contact map in identifying 
compact units of substructure. A representative contact map is on the left. The graph on 

15 the right is a statistical study of the average length of contiguous residues that can fold 
into a sphere of the indicated diameter (Gilbert 1998). This information can be used in 
the following way. If a 15-residue segment can fold into a sphere with a diameter of 21 
angstroms, then this segment could be considered as being of average compactness. 
However, if a 20-residue segment can fold into a sphere of 21 angstroms, this is 

20 - considered as having a significantly above-average compactness. This is visualized on 
the contact map as a triangle on the diagonal formed by the cut points required to 
generate the segment. If the segment fits into a sphere of the specified diameter, thqn the 
triangle will be entirely white (interacting). The contact map shows residues that are 
distant (black) and residues that are close (white). If a given segment, > folds an 

25 above average number of residues into a given sphere size, then it is compact. 

FIG. 20 is a comparison of (A) the Go-algorithm (using a diameter size = 21 
angstroms) with (B) the Id crossover disruption profile of transfonnylase. The Go- 
algorithm predicts that there are three domain-forming regions in the structure, whereas 
the Id crossover disruption profile (threshold energy of 0.2 kcal/mol) demonstrates that 

30 one of these domain-forming regions is not sampled because it causes too much 
disruption. 
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FIG. 21 is a two-dimensional contact map of beta-lactamase using = 21. 
Black regions indicate resides that are further than 2 1 angstroms apart and white residues 
indicate residues that are closer than 21 angstroms. The lines indicate the approximate 
locations of crossovers observed experimentally by Crameri et al (1998). 
5 FIG. 22 provides an analytical description of Go's algorithm for determining 

domains based on the contact map. The domain diameter d^ = 21 for these calculations 
and Equation (8) is used to determine the domain-forming ability of each residue. Low 
regions in this graph indicate suitable places for domain boundaries. The thick black 
horizontal lines indicate the approximate domain boundaries identified by this method 
10 and the thin vertical lines demarcate the regions where crossovers were observed 
experimentally by Crameri et al (1998). The domain algorithm identifies some of the 
general structure of where the crossover occurs, but makes a poor prediction overall. 

FIG. 23 shows an algorithm that combines the concept of disrupting a domain 
with the concept of disrupting coupling interactions. First, all fragments of size n min and 
15 greater are identified in the structure. Next, the fragments that fold into a sphere of 
diameter d^ and are coupled to the remainder of the structure above a threshold 
disruption value are separated. Finally, the schema disruption value of all the residues 
involved in the interacting compact unit are incremented by one, indicating that 
crossovers that occur in this region will disrupt a building block," and therefore be 

20 destabilizing. m _ _ _ 

FIG. 24 shows the schema disruption profile as determined from the 
transformylase structure. (A) No sequence identity was* considered (P, - Pj = 1 in 
Equation 3). The parameters are d c = 4.0, E^^ = 4.0. (B) Sequence identity is 
considered (Equation 3). The parameters are d c = 4.0 and E eJkmk = 1.0. The 
25 normalization of crossover disruption in both graphs was according to Equation (6). 

FIG. 25 shows the schema disruption profile as determined from the beta- 
lactamase structure compared with the experimentally observed crossover points (thick 
horizontal bars) (Crameri et al., 1998). (A) The profile as determined from the domain 
algorithm alone with 21 angstroms and n min = 15 residues (Equation 9). (B) The 
30 profile with disruptive domains removed where the crossover disruption was normalized 
as in Equation (3). The crossover disruption threshold was set to be E cthresh = 0.007 
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(corresponding to a Z-score of 0. 1). No sequence identity was considered (P, - P } = 1 
in Equation 3). (C) The profile with disruptive domains removed where the crossover 
disruption was normalized as in Equation (6). The crossover disruption threshold was 
set to be E^^ = 4.0 (corresponding to a Z-score of 0.4). No sequence identity was 
5 considered (P, = P } = 1 in Equation 3). (D) The same profile as in (C), except sequence 
identity is considered (Equation 3). The crossover disruption threshold was set to be 
Ecthmk " °- 6 (corresponding to a Z-score of 0.2). 

FIG. 26A shows a schema disruption calculation of the P450 2C5 structure. 
Equation (10) was used to generate the graph and the crossover disruption normalization 

1 0 scheme of Equation (3) was used. The parameters for this calculation are d c = 4.0, tkresh 
= 0.005 (corresponding to a Z-score of 0.3). The red lines indicate where experimentally 
generated single cut point recombination events led to folded chimeras (Pikuleva et al, 
1996). The arrow indicates the location of the crossover that resulted in a folded 
P450cam-P450 2C9 chimera (Shimoji et al, 1 998). Note that not all of the residues were 

15 resolved in the structure, so the numbering starts at 30 (e.g., residue 1 in the graph is 
residue 30) and residues 2 12-222 are missing. 

FIG. 26B shows a schema disruption calculation of the P450cam structure. 
Equation (10) was used to generate the graph and the crossover disruption normalization 
scheme of Equation (3) was used. The parameters for this calculation are d c = 4.0, E cthnsh 
„. - - 20 = 0.007 (corresponding toa Z-score of 0.65): The red line indicates the location of the - 
crossover that resulted in a folded P450cam-P450 2C9 chimera (Shimoji et al, 1998). 
Note that not all residues were resolved in the structure: residue 1 in the graph is residue 
7 in the structure. No sequence identity was considered for either P450 calculation (P ( 
= P y = l in Equation 3). 

25 FIGS. 27A and 27B illustrate a method for determining optimal parents for 

crossover recombination by analyzing the schema disruption experiment for a DNA 
shuffling experiment with beta-lactamase (Crameri et al., 1998). The parents in this 
example are: (1) Enterobacter cloacae, (2) Citrobacter freundii, (3) Yersinia 
enterocolitica, and (4) Klebsiella pneumoniae, 

30 FIG. 28 shows oligonucleotide fragments corresponding to the peptide schema 

for hybrid beta-lactamase proteins made by recombining TEM-1 and PSE-4 genes. 

« 
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Fragments are made by PCR amplification, where the primers at either end contain a 
short piece of DNA that overlaps with preceding gene fragment. 

FIG. 29 shows schema calculations for the construction of hybrid beta-lactamase 
proteins with increasing disruption. 
5 BIG. 30 shows a schema disruption profile for hybrid recombination of beta- 

lactamase PSE-4 and TEM- 1 . This profile shows that in the case of recombining TEM- 1 
and PSE-4, very few single crossovers will be acceptable. 

FIG. 31A illustrates a schema disruption. Black lines in the structure represent 
peptide bonds and the small dots are interactions between amino acid side chains. Two 
10 hybrid proteins are shown. When the last four residues come from one parent and the 
remaining residues come from the other parent, three interactions are disrupted. When 
the last eight residues come from the same parent, then there is no disruption. According 
to the schema approach of the invention, achieving folded hybrid proteins is more likely 
when the fewest interactions are disrupted. 
15 FIG. 31B shows the schema disruption profile of the structure in FIG. 31A, 

calculated using Equation 12 with a window size w = 6. 

FIG. 32 is condensed view of the schema disruption profiles for (from top to 
bottom) cephalosporinase, subtilisin, cytochrome P450, and transformylase. The black 
regions indicate schema and the white regions mark minima in the schema disruption 

20. profile - - - • - - 

FIG. 33 is a schema disruption profile of beta-lactamase TEM-1 with window 
size w = 15 and dc = 5.0 angstroms. The shaded regions, also identified as regions 1-9 
indicate the schema (Jelsch, C, Mourey, L., Masson, J. M., & Samama, J. P., (1993) 
Proteins 16, 364). 

25 FIG. 34 illustrates the activity of hybrid proteins as a function of their schema 

disruption. The upper dashed line is the wild type activity and the lower dashed line is 
the MIC of XL1 -Blue cells without beta-lactamase activity. As the disruption increases, 
there is a very sharp transition in the activity of the hybrid proteins and activity is lost 
around Es = 60. The points indicates the parent of the first fragment: dark grey is PSE-4 

30 (W), at approximately 30, 40 and 100 on the x-axis; light grey is TEM-1 (TJ 1 ), at 
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approximately 35 and 60 on the x-axis, and black, at approximately 0 and 70 on the x- 
axis is both. 

5. DETAILED DESCRIPTION OF THE INVENTION 

5 The invention overcomes problems in the prior art and provides novel methods 

which can be used for directed evolution of biopolymers such as proteins and nucleic 
acids. In particular, the invention provides methods which can be used to identify 
candidate locations in a biopolymer for crossovers, such that the biopolymer (e.g., 
polypeptide) will likely retain stability and functionality while allowing crossovers to 

10 occur. By generating hybrids that are recombined at selected candidate crossover 
locations or cut points, mutant or hybrid polymers having one or more improved 
properties maybe more readily identified while simultaneously reducing the numbers) 
of mutants screened. 

Details of the invention are described below, including specific examples. These 

15 examples are provided to illustrate embodiments of the invention. However, the 
invention is not limited to the particular embodiments, and many modifications and 
variations of the invention will be apparent to those skilled in the art. Such modifications 
and variations are also part of the invention. 

20- 5.1 Definitions - .......... 

The terms used in this specification generally have their ordinary meanings in the 
art, within the context of this invention and in the specific context where each term is 
used. Certain terms are discussed below, or elsewhere in the specification, to provide 
additional guidance to the practitioner in describing the compositions and methods of the 

25 invention and how to make and how to use them. The scope an meaning of any use of 
a term will be apparent from the specific context in which the term is used. 

Molecular Biology 

The term "molecule" means any distinct or distinguishable structural unit of 
30 matter comprising one or more atoms, and includes, for example, polypeptides and 
polynucleotides. 
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The term "polymer" means any substance or compound thai is composed of two 
or more building blocks Cmers 1 ) that are repetitively linked together. For example, a 
"dimer" is a compound in which two building blocks have been joined togther; a "trimer" 
is a compound in which three building blocks have been joined together; eta 
5 A "biopolymer" is any polymer having an organic or biochemical utility or that 

is produced by a cell. Preferred Copolymers include, but are not limited to, 
polynucleotides, polypeptides and polysaccharides. 

The term "polynucleotide" or "nucleic acid molecule" refers to a polymeric 
molecule having a backbone that supports bases capable of hydrogen bonding to typical 

1 0 polynucleotides, wherein the polymer backbone presents the bases in a manner to permit 
such hydrogen bonding in a specific fashion between the polymeric molecule and a 
typical polynucleotide (e.g., single-stranded DNA). Such bases are typically inosine, 
adenosine, guanosine, cytosine, uracil and thymidine. Polymeric molecules include 
"double stranded" and "single stranded" DNA and RNA, as well as backbone 

1 5 modifications thereof (for example, methylphosphonate linkages). 

Thus, a "polynucleotide" or "nucleic acid" sequence is a series of nucleotide bases 
(also called "nucleotides"), generally in DNA and RNA, and means any chain of two or 
more nucleotides. A nucleotide sequence frequently carries genetic information, 
including the information used by cellular machinery to make proteins and enzymes. The 
-20 terms include genomic DNA, cDNA, RNA,-any synthetic and genetically manipulated. - 
polynucleotide, and both sense and antisense polynucleotides. This includes single- and 
double-stranded molecules; i.e., DNA-DNA, DNA-RNA, and RNA-RNA hybrids as 
well as "protein nucleic acids" (PNA) formed by conjugating bases to an amino acid 
backbone. This also includes nucleic acids containing modified bases, for example, 

25 thio-uracil, thio-guanine and fluoro-uracil. 

The polynucleotides herein may be flanked by natural regulatory sequences, or 
may be associated with heterologous sequences, including promoters, enhancers, 
response elements, signal sequences, polyadenylation sequences, introns, 5 1 - and 
3 '-non-coding regions and the like. The nucleic acids may also be modified by many 

30 means known in the art Non-limiting examples of such modifications include 
methylation, "caps", substitution of one or more of the naturally occurring nucleotides 
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with an analog, and internucleotide modifications such as, for example, those with 
uncharged linkages (e.g., methyl phosphonates, phosphotriesters, phosphoroamidates, 
carbamates, etc.) and with charged linkages (e.g., phosphorothioates, 
phosphorodithioates, etc.). Polynucleotides may contain one or more additional 
5 covalently linked moieties, such as proteins (e.g., nucleases, toxins, antibodies, signal 
peptides, poly-L-lysine, etc.), intercalators (e.g., acridine, psoralen, etc.), chelators (e.g., 
metals, radioactive metals, iron, oxidative metals, etc. ) and alkylators to name a few. The 
polynucleotides may be derivatized by formation of a methyl or ethyl phosphotriester or 
an alkyl phosphoramidite linkage. Furthermore, the polynucleotides herein may also be 

10 modified with a label capable of providing a detectable signal, either directly or 
indirectly. Exemplary labels include radioisotopes, fluorescent molecules, biotin and the 
like. Other non-limiting examples of modification which may be made are provided, 
below, in the description of the invention. 

The tenn "oligonucleotide" refers to a nucleic acid, generally of at least 10, 

15 preferably at least 15, and more preferably at least 20 nucleotides, preferably no more 
than 100 nucleotides, that is hybridizable to a genomic DNA molecule, a cDNA 
molecule, or an mRNA molecule encoding a gene, mRNA, cDNA, or other nucleic acid 
of interest. Oligonucleotides can be labeled, e.g., with 32 P-nucleotides or nucleotides to 
which a label, such as biotin or a fluorescent dye (for example, Cy3 or Cy5) has been 

. 20 covalently conjugated. In one embodiment, an oligonucleotide can be- used as P-CR 

primers. Oligonucleotides therefore have many practical uses that are well known in the 
art. For example, a labeled oligonucleotide can be used as a probe to detect the presence 
of a nucleic acid. Generally, oligonucleotides are prepared synthetically, preferably on 
a nucleic acid synthesizer. Accordingly, oligonucleotides can be prepared with 

25 non-naturally occurring phosphoester analog bonds, such as thioester bonds, etc. 

A "polypeptide" is a chain of chemical building blocks called amino acids that are 
linked together by chemical bonds called "peptide bonds". The term "protein" refers to 
polypeptides that contain the amino acid residues encoded by a gene or by a nucleic acid 
molecule (e.g., an mRNA or a cDNA) transcribed from that gene either directly or 

30 indirectly. Optionally, a protein may lack certain amino acid residues that are encoded 
by a gene or by an mRNA. For example, a gene or mRNA molecule may encode a 
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sequence of amino acid residues on the N-terminus of a protein (i.e., a signal sequence) 
that is cleaved from, and therefore may not be part o£ the final protein. A protein or 
polypeptide, including an enzyme, may be a "native" or "wild-type", meaning that it 
occurs in nature; or it may be a "mutant", "variant" or "modified", meaning that it has 
5 been made, altered, derived, or is in some way different or changed from a native protein 
or from another mutant 

"Amplification" of a polynucleotide denotes the use of polymerase chain reaction 
(PCR) to increase the concentration of a particular DNA sequence within a mixture of 
DNA sequences. For a description of PCR see SdMetaL, Science 1988, 239:487. 
10 A "gene" is a sequence of nucleotides which code for a functional "gene product". 

Generally, a gene product is a functional protein. However, a gene product can also be 
another type of molecule in a cell, such as an RNA (e.g., a tRNA or a rRNA). For the 
purposes of the invention, a gene product also refers to an mRNA sequence which may 
be found in a cell. For example, measuring gene expression levels according to the 
15 invention may correspond to measuring mRNA levels. A gene may also comprise 
regulatory (i.e., non-coding) sequences as well as coding sequences. Exemplary 
regulatory sequences include promoter sequences, which determine, for example, the 
conditions under which the gene is expressed. The transcribed region of the gene may 
also include untranslated regions including introns, a 5-untranslated region (5-UTR) and 

20- a 3-untranslated region (3MJTR).- - — 

A "coding sequence" or a sequence "encoding" an expression product, such as a 
RNA, polypeptide, protein or enzyme, is a nucleotide sequence that, when expresse4 
results in the production of that RNA, polypeptide, protein or enzyme; i.e., the nucleotide 
sequence "encodes" that RNA or it encodes the amino acid sequence for that polypeptide, 
25 protein or enzyme. 

A "promoter sequence" is a DNA regulatory region capable of binding RNA 
polymerase in a cell and initiating transcription of a downstream (3' direction) coding 
sequence. A promoter sequence is typically bounded at its 3' terminus by the 
transcription initiation site and extends upstream (5 f direction) to include the minimum 
30 number of bases or elements necessary to initiate transcription at levels detectable above 
background. Within the promoter sequence will be found a transcription initiation site 
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(conveniently found, for example, by mapping with nuclease SI), as well as protein 
binding domains (consensus sequences) responsible for the binding of RNA polymerase; 

A coding sequence is "under the control of or is "operatively associated with" 
transcriptional and translational control sequences in a cell when RNA polymerase 
5 transcribes the coding sequence into RNA, which is then trans-RNA spliced (if it contains 
introns) and, if the sequence encodes a protein, is translated into that protein. 

The term "express" and "expression" means allowing or causing the information 
in a gene or DNA sequence to become manifest, for example producing RNA (such as 
rRNA or mRNA) or a protein by activating the cellular functions involved in 
10 transcription and translation of a corresponding gene or DNA sequence. A DNA 
sequence is expressed by a cell to form an "expression product" such as an RNA (e.g., 
a mRNA or a rRNA) or a protein. The expression product itself, e.g., the resulting RNA 
or protein, may also said to be "expressed" by the cell. 

The term "transfection" means the introduction of a foreign nucleic acid into a 

15 cell. The term "transformation" means the introduction of a "foreign" (i.e., extrinsic or 
extracellular) gene, DNA or RNA sequence into a host cell so that the host cell will 
express the introduced gene or sequence to produce a desired substance, in this invention 
typically an RNA coded by the introduced gene or sequence, but also a protein or an 
enzyme coded by the introduced gene or sequence. The introduced gene or sequence may 

20 — . also be called a "cloned" or- "foreign" gene or sequence,-may include regulatory or control 
sequences (e.g., start, stop, promoter, signal, secretion or other sequences used by acelTs 
genetic machinery). The gene or sequence may include nonfunctional sequences. or 
sequences with no known function. A host cell that receives and expresses introduced 
DNA or RNA has been "transformed" and is a "transformant" or a "clone". The DNA or 

25 RNA introduced to a host cell can come from any source, including cells of the same 
genus or species as the host cell or cells of a different genus or species. 

The terms "vector", "cloning vector" and "expression vector" mean the vehicle 
by which a DNA or RNA sequence (e.g., a foreign gene) can be introduced into a host 
cell so as to transform the host and promote expression (e.g., transcription and 

30 translation) of the introduced sequence. Vectors may include plasmids, phages, viruses, 
etc. and are discussed in greater detail below. 
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A "cassette" refers to a DNA coding sequence or segment of DNA that codes for 
an expression product that can be inserted into a vector at defined restriction sites. The 
cassette restriction sites are designed to ensure insertion of the cassette in the proper 
reading frame. Generally, foreign DNA is inserted at one or more restriction sites of the 
5 vector DNA, and then is carried by the vector into a host cell along with the transmissible 
vector DNA. A segment or sequence of DNA having inserted or added DNA, such as an 
expression vector, can also be called a "DNA construct" A common type of vector is 
a "plasmid", which generally is a self-contained molecule of double-stranded DNA, 
usually of bacterial origin, that can readily accept additional (foreign) DNA and which 

10 can readily introduced into a suitable host cell. A large number of vectors, including 
plasmid and fungal vectors, have been described for replication and/or expression in a 
variety of eukaryotic and prokaryotic hosts. 

The term "host cell" means any cell of any organism that is selected, modified, 
transformed, grown or used or manipulated in any way for the production of a substance 

15 by the cell. For example, a host cell may be one that is manipulated to express a 
particular gene, a DNA or RNA sequence, a protein or an enzyme. Host cells can further 
be used for screening or other assays that are described infra. Host cells may be cultured 
in vitro or one or more cells in a non-human animal (e.g., a transgenic animal or a 
transiently transfected animal). 

20 • The term "expression system" means a host cell and compatible vector under 

suitable conditions, e.g. for the expression of a protein coded for by foreign DNA carried 
by the vector and introduced to the host cell. Common expression systems include E. coli 
host cells and plasmid vectors, insect host cells such as Sf9, Hi5 or S2 cells and 
Baculovirus vectors, Drosophila cells (Schneider cells) and expression systems, fish cells 

25 and expression systems (including, for example, RTH-149 cells from rainbow trout, 
which are available from the American Type Culture Collection and have been assigned 
the accession no. CRL-1710) and mammalian host cells and vectors. 

The terms "mutant" and "mutation" mean any change in a particular polymer 
sequence (also sometimes referred to herein as a "parent sequence"). Mutations may 

30 include, but are not limited to, changes in the nucleotide sequence of a nucleic acid 
(including changes in the sequence of a gene), and also changes in the amino acid 
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sequence of a protein or polypeptide. Thus, in the invention these terms may refer to a 
difference of even one residue (e.g. one nucleic or amino acid), but more typically refer 
to recombined sequences that are substantially different from their parents. That is, a 
"mutant" includes the offspring of recombined parent sequences, as by combining (for 
5 example) genetic material from two parent genes. A mutant may also be referred to as 
a hybrid" or a 'Variant." 

The term "chimera" is synonymous with "recombinant mutant" and refers to an 
offspring gene which contains genetic material from one or more parents. 

The methods of the invention may include steps of comparing parent sequences 

1 0 to each other or a parent sequence to one or more mutants. Such comparisons typically 
comprise alignments of polymer sequences, e.g., using sequence alignment programs 
and/or algorithms that are well known in the art (for example, BLAST, FASTA and 
MEGALIGN, to name a few). The skilled artisan can readily appreciate that, in such 
alignments, where a mutation contains a residue insertion or deletion, the sequence 

15 alignment will introduce a "gap" (typically represented by a dash, or "A") in the 
polymer sequence not containing the inserted or deleted residue. Thus, for example, in 
an embodiment where a mutation introduces a single amino acid deletion in a parent 
sequence at amino acid residue i, an alignment of the parent and mutant polypeptide 
sequences will introduce a gap in the mutant sequence that aligns with amino acid residue 

20 i of the parent. - In such embodiments^ therefore, amino acid residue i -in the mutant 

sequence is preferably said to be a "gap" or "deletion". 

The term "heterologous" refers to a combination of elements not naturally 
occurring. For example, chimeric RNA molecules may comprise an rRNA sequence and 
a heterologous RNA sequence which is not part of the rRNA sequence. In this context, 

25 the heterologous RNA sequence refers to an RNA sequence that is not naturally located 
within the ribosomal RNA sequence. Alternatively, the heterologous RNA sequence may 
be naturally located within the ribosomal RNA sequence, but is found at a location in the 
rRNA sequence where it does not naturally occur. As another example, heterologous 
DNA refers to DNA that is not naturally located in the cell, or in a chromosomal site of 

30 the cell. Preferably, heterologous DNA includes a gene foreign to the cell. A 
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heterologous expression regulatory element is a regulatory element operatively associated 
with a different gene than the one it is operatively associated with in nature. 

The term "homologous" refers to the relationship between two biopolymers (e.g. 
polypeptides or oligonucleotides) that possess a common evolutionary origin. This 
5 includes, without limitation, proteins from superfamilies (e.g., the immunoglobulin 
superfamily) in the same species of organism, as well as homologous proteins from 
different species of organism (for example, myosin light chain polypeptide, etc.; see, 
Reeck et al, Cell 1987, 50:667). Such proteins (and their encoding nucleic acids) have 
sequence homology, as reflected by their sequence similarity, or regions of sequence 

1 0 similarity, however expressed. For example, "homology" can be expressed as sequence 
similarity in terms of percent sequence identity or by the presence of specific residues or 
motifs and conserved positions. 

The terms "sequence similarity" and "sequence identity", in all their grammatical 
forms, refers to the degree of identity or correspondence between nucleic acid or amino 

1 5 acid sequences that may or may not share a common evolutionary origin (see, Reeck et 
aL, supra). However, in common usage and in the instant application, the term 
"homologous", particularly when modified with an adverb such as "highly, may refer to 
sequence similarity and may or may not relate to a common evolutionary origin. 

The term "recombination" and variant spellings thereof, encompasses both 

20 "homologous" and -"non-homologous" recombination. - In its most basic -form, - 
recombination is the exchange of biopolymer fragments between two biopolymer . 
sequences. As defined in this invention, sequences may be recombined at the amino acid 
or nucleic acid level. 

The term "homologous recombination" refers to the exchange of biopolymer 

25 fragments between two or more biopolymer sequences at locations where the sequences 
exhibit regions of sequence homology. In more general biological terms, recombination 
refers to the insertion of a modified or foreign DNA sequence contained by a first vector 
into another DNA sequence contained in second vector, or a chromosome of a cell. The 
first vector targets a specific chromosomal site for homologous recombination. For 

30 homologous recombination, the first vector will contain sufficiently long region of 
homology to sequences of the second vector or chromosome to allow complementary 
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binding and incorporation of DNA from the first vector into the DNA of the second 
vector, or the chromosome. 

According to the invention, the sequence similarity of biopolymers being 
recombined can be high, low, or none, and indeed can range from less than 50% (e.g., 0% 
5 to as high as 1 00%. Where parent sequences are homologous, i.e. have some threshold 
of sequence identity, alignments may be used to aid in the selection of cut points and 
fragments for recombination. Alignments are also used for certain recombination 
protocols, such as DNA shuffling, which can be modeled according to the invention. 
However, other recombinations do not require alignments, such as the ITCHY protocol, 

1 0 and these also can be modeled to calculate a schema disruption profile. A model of non- 
homologous (non-sequence identity) recombination is illustrated by FIG, 1 A and FIG. 
5, discussed infra. Crossovers can be calculated for 0% sequence identity, as long as the 
parents fold into the same (or similar) structures. Cut points are determined as in FIG. 
2, which does not require or imply sequence identity. 

1 5 The term "non-homologous recombination" refers to the exchange ofbiopolymer 

fragments between two biopolymer sequences that are not homologous, or that do not 
share sequence identity, for example according to a given threshold. As used herein, non- 
homologous biopolymers, like homologous biopolymers, may or may not have a common 
evolutionary origin, and in preferred embodiments they do have a common evolutionary 

20 origin. However,"non-homologous biopolymers, unlikchomoiogous biopolymers, have - 
no sequence identity, or the sequence identity (if any) is less than a given minimum. 

In certain embodiments of the invention; biopolymers or fragments thereof may 
be selected for recombination based on any suitable energy or structural data, not 
necessarily homology or sequence identity. For example, cut points or schema may be 

25 selected based on structural input such as interatomic distances, without regard for 
sequence identity. That is, the biopolymers may or may not have any, or a given degree, 
of sequence identity. Optimal schema (and fragments) can be determined from this data 
without regard for the recombination or shuffling protocol. In addition, alignment data 
from homologous sequences or regions, if any, can be used as additional structural input 

30 to further refine the selected schema and optimal fragments for recombination. 
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A nucleic acid molecule is "hybridizable" to another nucleic acid molecule, such 
as a cDNA, genomic DNA, or RNA, when a single stranded form of the nucleic acid 
molecule can anneal to the other nucleic acid molecule under the appropriate conditions 
of temperature and solution ionic strength (see Sambrook et al. 9 supra). The conditions 
5 of temperature and ionic strength determine the "stringency H of the hybridization. For 
preliminary screening for homologous nucleic acids, low stringency hybridization 
conditions, corresponding to a T m (melting temperature) of 55°C, can be used, e.g., 5x 
SSC, 0.1% SDS, 0.25% milk, and no formamide; or 30% formamide, 5x SSC, 0.5% 
SDS). Moderate stringency hybridization conditions correspond to a higher T m , e.g., 

10 40% formamide, with 5x or 6x SCC. High stringency hybridization conditions 
correspond to the highest T m , e.g., 50% formamide, 5x or 6x SCC. SCC is a 0.15M 
NaCI, 0.01 5M Na-citrate. Hybridization requires that the two nucleic acids contain 
complementary sequences, although depending on the stringency of the hybridization, 
mismatches between bases are possible. The appropriate stringency for hybridizing 

15 nucleic acids depends on the length of the nucleic acids and the degree of 
complementation, variables well known in the art. The greater the degree of similarity 
or homology between two nucleotide sequences, the greater the value of T m for hybrids 
of nucleic acids having those sequences. The relative stability (corresponding to higher 
Tm) of nucleic acid hybridizations decreases in the following order: RNA:RNA, 

20 - DNArRNA, DNA:DNA. For hybrids of greater than-1 00 nucleotides in length, equations 
for calculating T m have been derived (see Sambrook et aL, supra, 9.50-9.51). For 
hybridization with shorter nucleic acids, i.e., oligonucleotides, the position of mismatches 
becomes more important, and the length of the oligonucleotide determines its specificity 
(see Sambrook al, supra, 11.7-11.8). A minimum length for a hybridizable nucleic 

25 acid is at least about 10 nucleotides; preferably at least about 15 nucleotides; and more 
preferably the length is at least about 20 nucleotides. 

Unless specified, the term "standard hybridization conditions" refers to a T m of 
about 55°C, and utilizes conditions as set forth above. In a preferred embodiment, the 
T m is 60°C; in a more preferred embodiment, the T m is 65°C. In a specific embodiment, 

30 "high stringency" refers to hybridization and/or washing conditions at 68°C in 0.2XSSC, 
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at 42°C in 50% formamide, 4XSSC, or under conditions that afford levels of 
hybridization equivalent to those observed under either of these two conditions. 

Suitable hybridization conditions for oligonucleotides (e.g., for oligonucleotide 
probes or primers) are typically somewhat different than for full-length nucleic acids 
5 (e.g., full-length cDNA), because of the oligonucleotides 1 lower melting temperature. 
Because the melting temperature of oligonucleotides will depend on the length of the 
oligonucleotide sequences involved, suitable hybridization temperatures will vary 
depending upon the oligonucleotide molecules used. Exemplary temperatures may be 
37 °G (for 14-base oligonucleotides), 48 °C (for 17-base oligonucleotides), 55 °C (for 

1 0 20-base oligonucleotides) and 60 °C (for 23-base oligonucleotides). Exemplary suitable 
hybridization conditions for oligonucleotides include washing in 6x SSC/0.05% sodium 
pyrophosphate, or other conditions that afford equivalent levels of hybridization. 

The term "isolated" means that the referenced material is Temoved from the ' 
environment in which it is normally found. Thus, an isolated biological material can be 

15 free of cellular components, i.e., components of the cells in which the material is found 
or produced. In the case of nucleic acid molecules, an isolated nucleic acid includes a 
PCR product, an isolated mRNA, a cDNA, or a restriction fragment In another 
embodiment, an isolated nucleic acid is preferably excised from the chromosome in 
which it may be found, and more preferably is no longer joined to non-regulatory, 
- 20— non-coding regions, or to other genes r 4ocated upstream or downstream of the gene— 
contained by the isolated nucleic acid molecule when found in the chromosome. In yet 
another embodiment, the isolated nucleic acid lacks one or more introns. Isolated nucleic 
acid molecules include sequences inserted into plasmids, cosmids, artificial 
chromosomes, and the like. Thus, in a specific embodiment, a recombinant nucleic acid 

25 is an isolated nucleic acid. An isolated protein may be associated with other proteins or 
nucleic acids, or both, with which it associates in the cell, or with cellular membranes if 
it is a membrane-associated protein. An isolated organelle, cell, or tissue is removed from 
the anatomical site in which it is found in an organism. An isolated material may be, but 
need not be, purified. 

30 The term "purified" refers to material that has been isolated under conditions that 

reduce or eliminate the presence of unrelated materials, i.e., contaminants, including 
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native materials from which the material is obtained. For example, a purified protein is 
preferably substantially free of other proteins or nucleic acids with which it is associated 
in a cell; a purified nucleic acid molecule is preferably substantially free of proteins or 
other unrelated nucleic acid molecules with which it can be found within a cell. The term 
5 "substantially free" is used operationally, in the context of analytical testing of the 
material Preferably, purified material substantially free of contaminants is at least 50% 
pure; more preferably, at least 90% pure, and more preferably still at least 99% pure. 
Purity can be evaluated by chromatography, gel electrophoresis, immunoassay, 
composition analysis, biological assay, and other methods known in the art. 

10 Methods for purification are well-known in the art. For example, nucleic acids 

can be purified by precipitation, chromatography (including preparative solid phase 
chromatography, oligonucleotide hybridization, and triple helix chromatography), 
ultracentrifugation, and other means. Polypeptides and proteins can be purified by 
various methods including, without limitation, preparative disc-gel electrophoresis, 

15 isoelectric focusing, HPLC, reversed-phase HPLC, gel filtration, ion exchange and 
partition chromatography, precipitation and salting-out chromatography, extraction, and 
countercurrent distribution. For some purposes, it is preferable to produce the 
polypeptide in a recombinant system in which the protein contains an additional sequence 
tag that facilitates purification, such as, but not limited to, a polyhistidine sequence, or 

20- a sequence that specifically binds to an -antibody, such as FLAG and GST. The- 
polypeptide can then be purified from a crude lysate of the host cell by chromatography 
on an appropriate solid-phase matrix. Alternatively, antibodies produced against, the 
protein or against peptides derived therefrom can be used as purification reagents. Cells 
can be purified by various techniques, including centrifugation, matrix separation (e.g., 

25 nylon wool separation), panning and other immunoselection techniques, depletion (e.g., 
complement depletion of contaminating cells), and cell, sorting (e.g., fluorescence 
activated cell sorting or FACS). Other purification methods are possible. A purified 
material may contain less than about 50%, preferably less than about 75%, and most 
preferably less than about 90%, of the cellular components with which it was originally 

30 associated. The "substantially pure" indicates the highest degree of purity which can be 
achieved using conventional purification techniques known in the art. 
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In preferred embodiments, the terms "about" and "approximately 1 shall generally 
mean an acceptable degree of eiror for the quantity measured given the nature or 
precision of the measurements. Typical, exemplary degrees of error are within 20 percent 
(%), preferably within 10%, and more preferably within 5% of a given value or range of 
5 values. Alternatively, and particularly in biological systems, the terms "about" and 
"approximately 11 may mean values that are within an order of magnitude, preferably 
within 5-fold and more preferably within 2-fold of a given value. Numerical quantities 
given herein are approximate unless stated otherwise, meaning that the tenn "about" or 
"approximately 1 can be inferred when not expressly stated. 

1 0 Molecular Physics 

The term "sequence space" refers to the set of all possible sequences of residues 
for a polymer having a specified length. Thus, for example, the sequence space for a 
protein or polypeptide 300 amino acid residues in length is the group consisting of all 
sequences of 300 amino acid residues, e.g. 20 300 = 10 390 sequences of 300 amino acids. 

15 Similarly, the sequences space of a nucleic acid 300 nucleotides in length is the group 
consisting of all sequences of 300 nucleotides, etc. 

"Conformational energy" refers generally to the energy associated with a 
particular "conformation", or three-dimensional structure, of a polymer, such as the 
energy associated with the conformation of a particular protein or nucleic acid. 

20- Interactions that tend to stabilize a macromolecule such as a polymer (e.g.,-a protein or - 
nucleic acid) have energies that are quantitatively represented in this specification as 
negative energy values, whereas interactions that destabilize a polymer have positive 
energy values. Thus, the conformational energy for any stable polymer is quantitatively 
represented by a negative conformational energy value. Generally, the conformational 

25 energy for a particular polymer will be related to that polymers stability. In particular, 
polymers and other macromolecules that have a lower (i.e., more negative) 
conformational energy are typically more stable, e.g., at higher temperatures (i.e., they 
have greater "thermal stability")- Accordingly, the conformational energy of a polymer 
may also be referred to as the polymer's "stabilization energy". 

30 Typically, the conformational energy is calculated using an energy "force-field" 

that calculates or estimates the energy contribution from various interactions which 
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depend upon the conformation of a polymer. The force-field is comprised of terms that 
include the conformational energy of the alpha-carbon backbone, side chain - backbone 
interactions, and side chain - side chain interactions. Typically, interactions with the 
backbone or side chain include terms for bond rotation, bond torsion, and bond length. 
5 The backbone-side chain and side chain-side chain interactions include van der Waals 
interactions, hydrogen-bonding, electrostatics and solvation terms. Electrostatic 
interactions may include coulombic interactions, dipole interactions and quadrapole 
interactions). Other similar terms may also be included. Force-fields that may be used 
to determine the conformational energy for a polymer are well known in the art and 

10 include the CHARMM (see, Brooks et al , J. Comp. Chem. 1983,4:187-217; MacKerell 
et al, in The Encyclopedia of Computational Chemistry, Vol. 1 :271-277, John Wiley & 
Sons, Chichester, 1998 ), AMBER (see, Cornell et al, J. Amer. Chem. Soc. 1995, 
117:5179; Vfoodsetal. 9 J.Phys. Chem. 1995,99:3832-3846; Weiner etaL, J. Comp. 
Chem. 1986, 7:230; and Weiner et a!., J. Amer. Chem. Soc. 1984, 106:765) and 

15 DREID1NG (Mayo et al. 9 J. Phys. Chem. 1990, 94:8897) force-fields, to name a few. 

In a preferred implementation, the hydrogen bonding and electrostatics terms are 
as described in Dahiyat & Mayo, Science 1997 278:82). The force field can also be 
described to include atomic conformational terms (bond angles, bond lengths, torsions), 
as in other references. See e.g. , Nielsen JE, Andersen KV, Honig B, Hooft RWW, Klebe 

20 ' G, Vriend G, & Wade RC, "^Improving macromolecular electrostatics calculations," 

Protein Engineering, 12: 657662(1999); Stikoff D, Lockhart DJ, Sharp KA & Honig B, 
"Calculation of electrostatic effects at the ammo-terminus of an alpha-helix," Biophys. 
J., 67: 2251-2260 (1994); Hendscb ZS, Tidor B, "Do salt bridges stabilize proteins - a 
continuum electrostatic analysis," Protein Science, 3:211 -226 (1 994); Schneider JP, Lear 

25 JD, DeGrado WF, "A designed buried salt bridge in a heterodimeric coil," J. Am. Chem. 
Soc, 1 1 9: 5742-5743 (1997); Sidelar CV, Hendsch ZS, Tidor B, "Effects of salt bridges 
on protein structure and design," Protein Science, 7: 1 898-1914 (1998). Solvation terms 
could also be included. See e.g., Jackson SE, Moracci M, elMastry N, Johnson CM, 
Fersht AR, "Effect of Cavity-Creating Mutations in the Hydrophobic Core of 

30 Chymotiypsin Inhibitor 2," Biochemistry, 32: 11259-11269 (1993); Eisenberg, D & 
McLachlan AD, "Solvation Energy in Protein Folding and Binding," Nature, 319: 199- 
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203 (1986); Street AG & Mayo SL, "Pairwise Calculation of Protein Solvent- Accessible 
Surface Areas," Folding & Design, 3: 253-258 (1998); Eisenberg D & Wesson L, 
"Atomic solvation parameters applied to molecular dynamics of proteins in solution," 
Protein Science, 1: 227-235 (1992); Gordon & Mayo, supra. 
5 "Coupled residues" are residues in a polymer that interact, through any 

mechanism. The interaction between the two residues is therefore referred to as a 
"coupling interaction". Coupled residues generally contribute to polymer fitness through 
the coupling interaction. Typically, the coupling interaction is a physical or chemical 
interaction, such as an electrostatic interaction, a van der Waals interaction, a hydrogen 

10 bonding interaction, or a combination thereof. As a result of the coupling interaction, 
changing the identity of either residue will affect the fitness of the polymer, particularly 
if the change disrupts the coupling interaction between the two residues. Coupling 
interactions may also preferably be described by a distance parameter between residues 
in a polymer. If the residues are within a certain cutoff distance, they are considered 

1 5 interacting. This approach provides good results and can be computed relatively quickly. 

If a coupling interaction is considered disrupted by crossover recombination, a 
"crossover disruption" (E c ) parameter for each mutant can be determined. The 
"crossover disruption" (E c ) of a mutant is determined by the number of disrupted coupled 
interactions caused by the crossover from one sequence to another. Coupled, pairwise 

20 interactions between amino acids from different parent sequences are summed, while the - - 

interactions within fragments and shared between fragments from the same parent are not 
counted. Candidate or optimal crossover locations on genes correspond to locations that 
permit recombination with minimal disruption of coupling interactions, e.g. without 
disrupting parental clusters of favorably interacting DNA residues (building blocks or 

25 schema) in the parental genes. 

A "crossover disruption profile" is the crossover disruption that would result if 
a crossover occurred at a given residue (or each residue) of a biopolymer sequence. 

The term "crossover" refers to a recombination process in which an exchange of 
polymer sequences occurs between two linear polymer sequences, e.g. any point at which 

30 the genetic material from two parents is switched in an offspring. 
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A "schema disruption" is the disruption of a set of residues that interact in a 
collectively beneficial way. For example, it may be harmful to the recombinant mutant 
sequences if the residues participating in a schema come from different parents. Schema 
disruption is a combination of the disruption of independent structural elements 
5 (domains) or structural elements that cause a breaking of coupling interactions. See e.g., 
Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, 
Ann Arbor, MI (1975). 

Thus, schema are clusters of amino acids in the structure that interact in some 
positive way. For example, they may interact through hydrogen-bonds to stabilize the 
10 structure or they may interact to perform the catalytic function of a protein (enzyme). 
When these clusters of interacting residues are separated by recombination (because some 
come from one parent and others come from a different parent), this has a detrimental 
effect on the protein - e.g. by destabilizing it, or making it non-functional. An objective 
of the invention is to minimize and prevent schema disruption, e.g. by modeling the 
1 5 recombination of parent fragments to preserve schema in the resulting mutants. 

A "domain disruption" is the disruption of a compact structural domain or folding 
m unit of a biopolymer, e.g. a protein. 

Schema disruption and domain disruption may also be profiled, in a manner akin 
to crossover disruption profiles. 

. 20 The "crossover probability", which is also denoted hereiy the symbols-is the 

probability that a crossover will occur between two given nucleic or amino acid 
sequences (for example, between two homologous genes). Crossover probability is 
related to the experimental average fragment size in recombination experiments, and is 
a parameter that can be influenced or controlled in certain recombination protocols. For 
25 example, crossover probability can be controlled in DNA shuffling according to the time 
that parental templates are exposed to the DNA-cleaving DNAse. In StEP 
recombination, this is controlled by timing the annealing/extension cycles. The 
relationship between fragment size f and crossover probability P c can be expressed as: P c 
= (f-l)/N, where N is either the number of amino acid residues (when calculating 
30 recombinant mutants based on a protein sequence), or the number of nucleotides (when 
calculating the recombinants based on the DNA sequence). 
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The terms "crossover location" and "cut-point" are synonymous. The term refers 
to the location on a biopolymer sequence where recombination occurs. A cut point is a 
specific position at which a polymer sequence is broken in recombination. 

The term "crossover region"refers to the area surrounding the crossover location, 
5 for example within a range of residues on either side of a cut point. In certain 
experiments and recombination methods the precise location of a cut point is uncertain 
or cannot be determined or experimentally resolved. For example, when two parents 
share sequence identity, it may not be possible to determine from the sequence of the 
recombinant offspring precisely where within an aligned or surrounding region the cut 

10 point (crossover) occurred. The range of possible cut points, each of which could have 
produced the observed recombination results, can be called the crossover region. Once 
a region of sequence identity (a crossover region) has been identified, the specific 
placement of the cut point is not critical. 

The term "fitness" is used to denote the level or degree to which a particular 

15 property or combination of properties for a polymer (e.g. a biopolymer such as a protein 
or a nucleic acid) is optimized. In directed evolution methods of the invention, the fitness 
of a polymer, is preferably determined by properties which are identified for 
improvement. For example, the fitness of a protein may refer to the protein's stability 
(e.g. at different temperatures or in different solvents), its biological activity or efficiency 

20 . (e.g. catalytic function), its .binding affinity or selectivity. (e.g. eriantioselectivity),.its .... 
solubility (e.g. in aqueous or organic solvent), and the like. 

Fitness can be determined or evaluated experimentally or theoretically, e.g. 
computationally. Other examples of fitness properties include enantioselectivity, activity 
towards non-natural substrates, and alternative catalytic mechanisms. Coupling 

25 interactions can be modeled as a way of evaluating or predicting fitness. 

Preferably, the fitness is quantitated so that each polymer (e.g., each amino acid 
or nucleotide sequence) will have a particular "fitness value". For example, the fitness 
of a protein may be the rate at which the polymer catalyzes a particular chemical reaction, 
or the protein's binding affinity for a ligand. In another embodiment, the fitness of a 

30 polymer refers to the conformational energy of the polymer and is calculated, e.g., using 
any method known in the art. 
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Generally, the fitness of a polymer is quantitated so that the fitness value increases 
as the property or combination of properties is optimized. For example, where the 
thermal stability of a polymer is to be optimized (conformational energy is preferably 
decreased), the fitness value may be the negative conformational energy, i.e., F= -E. 

Such techniques are found in the following exemplary references: Brooks B.R., 
Bruccoleri RE, Olafson, BD, States DJ, Swarninathan S & Karplus M, "CHARMM: A 
Program for Macromolecular Energy, Minimization, and Dynamics Calculations," J. 
Comp. Chem., 4: 187-217 (1983); Mayo SL, Olafcon BD & Goddard WAG, 
"DREIDING: A Generic Force Field for Molecular Simulations," J. Phys. Chem., 94: 
8897-8909 (1990); Pabo CO & Suchanek EG, "Computer-Aided Model-Building 
Strategies for Protein Design," Biochemistry, 25: 5987-5991 (1986); Lazar GA 
Desjarlais JR & Handel TM, "De Novo Design of the Hydrophobic Core of Ubiquitin," 
Protein Science, 6: 1 167-1 178 (1997); Lee C & Levitt M, "Accurate Prediction of the 
Stability and Activity Effects of SiteDirected Mutagenesis on a Protein Core," Nature, 
352: 448-451 (1991); Colombo G & Merz KM, "Stability and Activity of Mesophilic 
Subtilisin E and Its Thermophilic Homolog: Insights from Molecular Dynamics 
Simulations," J. Am. Chem. Soc, 121:6895-6903 (1999); Werner SLKollman PA Case 
DA Singh UC, Ghio C, Alagona G, Profeta SJ, Weiner P, "A new force field for 
molecularmechanical simulation of nucleic acids and proteins," J. Am. Chem. Soc, 106: 
20 765-784 (1984). 

The term "fitness landscape" is used to describe the set of all fitness values 
belonging to all polymer sequences in a sequence space. Thus, for example, referring 
again to the sequence space for proteins 300 amino acid residues in length (i.e., the group 
consisting of all sequences of 3 00 amino acid residues), each polypeptide in the sequence 

25 space will have a particular fitness value that may (at least in theory) be calculated or 
measured (e.g., by screening each polypeptide to determine its fitness). The set of these 
fitness values is therefore the fitness landscape of the sequence space for proteins 300 
amino acid residues in length. In many embodiments fitness values may vary 
considerably among individual sequences in a given sequence space. The fitness value 

30 for a given sequence may be higher or lower than other, similar sequences in the 
sequence space. These fitness values are therefore referred to as "local maxima" (or 
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"local optima") and "local minima", respectively. Such a fitness landscape is described 
as "rugged" when it contains many local maxima and/or local minima in the fitness 
values. In the all representations of the fitness landscape, there is a "global optimum," 
representing the sequence with the highest fitness. If the hi^est fitness is degenerate 
5 (multiple sequence have the same fitness), then more than a single sequence can be the 
global optimum. An objective of directed evolution and computational design methods 
is to generate sequences having fitness values greater than the fitness value(s) of the 
starting (e.g. parent) sequence or sequences. In a preferred embodiment of the invention, 
the directed evolution and computational design methods generate sequences having 

10 fitness values as close to the global optimum as is possible. 

The "fitness contribution" of a polymer residue refers to the level or extent/(z fl ) 
to which the residue i a , having an identity a, contributes to the total fitness of the 
polymer. Thus, for example, if changing or mutating a particular polymer residue will 
greatly decrease the polymer's fitness, that residue is said to have a high fitness 

15 contribution to the polymer. By contrast, typically some residues i a in a polymer may 
have a variety of possible identities a without affecting the polymer's fitness. Such 
residues, therefore have a low contribution to the polymer fitness. 

5.2 General Methods 

20 In accordance with the invention, there maybe employed conventional molecular. - . 

biology, microbiology and recombinant DNA techniques within the skill of the art. Such 
techniques are explained fully in the literature. See, for example, Sambrook, Fitsch,& 
Maniatis, Molecular Cloning: A Laboratory Manual, Second Edition (1 989) Cold Spring 
Harbor Laboratory Press, Cold Spring Harbor, New York (referred to herein as 

25 "Sambrook et al, 1989"); DNA Cloning: A Practical Approach, Volumes I and II (D.N. 
Glover ed. 1985); Oligonucleotide Synthesis (MJ. Gait ed. 1984); Nucleic Acid 
Hybridization (B.D. Hames & SJ. Higgins, eds. 1984); Animal Cell Culture (R.L 
Freshney, ed. 1986); Immobilized Cells and Enzymes (IRL Press, 1986); B.E.Perbal,i4 
Practical Guide to Molecular Cloning (1984); F.M. Ausubel et al (eds.), Current 

30 Protocols in Molecular Biology, John Wiley & Sons, Inc. (1994). 
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The invention pertains to a computational method for identifying cut points or 
locations in proteins that will permit crossovers in in vitro recombination experiments, 
while retaining structural stability (and consequently, desirable properties) in the 
offspring hybrid proteins. The invention can be applied to protein sequences of any or 
5 no sequence similarity. Sequence and tertiary structural information at the protein level 
for at least one of the starting parental sequences is used to identify structural domains 
or coupled residues and calculate their disruption. 

5.3 Overview of Modeling Techniques 

10 Disruption Profiles. According to the invention, recombination modeling 

calculations are applied to determine the disruption of a biopolymer fragment (e.g. a 
schema disruption profile and/or a crossover disruption profile) relative to the remainder 
of the structure. In other words, what structural changes produced by recombination are 
compatible with a parent or a functional "starting 1 ' or Reference" structure? In this way, 

15 recombinations (and recombinants) that are predicted to disrupt schema (or coupling 
interactions) can be eliminated in favor of a smaller library of recombinants predicted to 
preserve them. This library is more likely to contain offspring which retain essential 
and/or beneficial propeities(such as activity and stability) and can be searched for other 
or improved properties relative to their parents. The techniques for determining 

20 - disruption profiles include: (a) calculation of crossover disruption,-e.g. using distance- - - 
based or energy based criteria for coupling; (b) calculation of domains in the protein 
structure; and (c) calculating the disruption (e.g. a disruption profile) based on a 
crossover disruption, domain disruption, or both. A schema disruption based on a 
combination of the domain and crossover disruption is preferred. Distance-based criteria 

25 for crossover disruption of coupling is also preferred. 

Recombination Modeling. Calculations are made to model possible parental 
fragments for recombination based on: (a) a requirement of sequence identity between 
parents (for sequence-identity-dependent experimental protocols, such as DNA 
shuffling); (b) a constraint on the number and location of crossovers (for example, the 

30 ITCHY protocol allows a single cut point, which very substantially reduces the number 
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of possible fragments; (c) other specified constraints, e.g., exon shuffling; and/or (d) a 
protocol without constraints (used to determine the optimal crossover). 

5.4 Schema Disruption Model 

5 Interactions among residues of a biopolymer can be modeled as schema, which 

in turn can be evaluated (e.g. in a schema disruption profile) to determine optimum 
crossover locations for recombining two or more parent molecules. Schema can be based 
on coupling interactions between residues, e.g. based on conformational energy and/or 
interatomic distances. According to the invention, crossover locations that do not disrupt 

1 0 coupling interactions or schema are preferred. 

Principles of crossover disruption of coupling interactions according to the 
invention are illustrated in FIG. 2. A "Protein Z" having amino acid residues (shown as 
circles) at positions 1 through 12 is shown in cartoon form. In part A of FIG. 2, Protein 
Z is shown in a folded cartoon at the left, and in a two-dimensional representation of its 

1 5 folded three-dimensional conformation at the right. These drawings indicate the relative 
location or position in space of each residue with respect to the other residues. The black 
line represents peptide bonds between the residues 1-12. The grey dotted lines represent 
coupling interactions between amino acid side chains. For example, residue 3 is joined 
to residues 2 and 4 by peptide bonds (solid lines). Residue 3 is coupled to residues 1 1 

20- and 12 by coupling- interactions (dotted lines), which may be associated- with any 
molecular forces other than the peptide bonds of the protein's primary structure. 

The coupling interactions can be mapped to a coupling matrix, as shown for 
example in part B of FIG. 2, In this view of the matrix, the primary amino acid sequence 
1-12 is shown in linear form, with each superimposed line indicating a coupling 

25 interaction. The number of interactions affecting each residue is conveniently shown. 
These lines also show which residues are coupled to each other. 

According to the invention it is desirable for recombination to minimize 
disruption of coupling interactions. This can be achieved, for example, by cutting the 
sequence for recombination at locations selected so that the least number of interactions 

30 are separated onto different fragments. Desirable or optimum cut points can be identified 
with the aid of a crossover interaction profile, or of a crossover disruption (EJ, as shown 
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graphically in part C of F1G.2. The graph shows the crossover disruption or the 
number of coupling interactions that are broken (y-axis), for each residue of the protein 
(x-axis), when a single cut is located before each residue. (A cut point can be named for 
the residue it follows or proceeds. In this example, each cut point occurs before the 
5 named residue.) For example, if Protein Z is cut at residue 3 in a recombination 
experiment, i.e. the cut is between residues 2 and 3, the resulting fragments for 
recombination (e.g. from two different parent proteins) are a fragment 1 -2 and a fragment 
3-12. (Note that in the example recombination actually occurs at the genetic level, at a 
corresponding cut point in a nucleotide sequence that encodes Protein Z.) The graph C 
10 of Fig. 2, line the diagram B, show that for this hypothetical protein Z, a cut point at 
residue 3 will disrupt seven coupling interactions. Crossover disruption can be calculated 
by computer, using known programming methods. 

For the simple structure of Protein Z, the graph shows that the crossover 
disruption is greatest if a cut is made in the center of the gene, e.g. at a nucleotide triplet 
15 or codon corresponding to one of the amino acid residues 4-8. According to the 
invention, cut points are selected to minimize crossover disruption, so there is a bias in 
this example toward selecting cut points at the ends or termini of Protein Z. For example, 
a cut point at residue 11 (e.g. parent A donates residues 1-10 and parent B donates 
residues 11-12) will produce mutants having less crossover disruption than a cut point 
20 atresidue 6 (parent A donates residues 1-5 and parent B donates residues 6-12). Mutants. . 
with less crossover disruption are more likely to be functional and retain desirable 
properties from one or both parents. When such parents are used, in directed evolution; 
experiments, the probability of adding or improving desirable properties, without loss of 
stability, utility or functionality, is increased. In this way the methods of the invention 
25 are not random. Cut points for recombination are not obtained only as a random 
consequence of known directed evolution methods, such as error-prone PCR, family 
shuffling or StEP. Rather, more favorable or promising cut points are identified and 
preselected according to an evaluation of coupling interactions and crossover disruption, 
as illustrated in the coupling matrix of FIG. 2. 
30 The invention is not limited to the use of a single cut point. More than one cut 

point may be used to provided a plurality of fragments from two or more parents for 
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recombination. For example, two cut points can be selected for hypothetical Protein Z, 
indicated by scissor icons in part B of FIG. 2. When the residues between these cut 
points come from parent A (residues 4-7) and the terminal fragments come from parent 
B (residues 1-3 and 8-12) the crossover disruption is reduced to zero. According to the 
5 invention, these cut points and the resulting parental fragments would be preferred for 
recombination experiments, e.g. where mutants obtained from such recombinations are 
screened for desirable properties, including new or modified properties, or the loss or 
reduction of one or more undesirable properties. 

10 Calculation of Coupling Interactions From A Crystal Structure 

As shown by FIG, IB, a structure file of a parent polymer is obtained, such as a 
data file representing the three-dimensional structure of a gene or a protein. Databases 
of this kind are known in the art. Coupling interactions between the building blocks of 
the polymer are then identified from the structural data, using the methods described 

1 5 herein. From the identified coupling interactions, structural domains, or compact units 
of structure, can be identified and represented as a schema for the polymer. For example, 
when the polymer is a protein, and the schema building blocks are amino acid residues, 
the set of residues contributing to each domain of the three-dimensional protein structure 
can be determined. Because a protein is folded, the residues which interact and 

20 - participate in a domain may and often are not adjacent to eaGh-other in the linear or 
primary sequence of the protein. This is shown, for example, in the cartoon for "Protein 
Z" in FIG. 2A, where amino acid residues 1 and 8 are close to each other in the three 
dimensional structure. 

Domains, for example folding domains, can be identified by testing for residues 

25 which interfere with structural stability, and which form groups of residues that are 
considered essential or important to stability, based on threshold criteria as described 
herein (e.g. conformational energy or atomic distance thresholds). Groups of residues 
which, if altered, would significantly impair structural stability are identified as domains. 
Crossover disruptions can be calculated for the residues, using the methods described 

30 herein, to identify domains and generate schema profile. See e.g., the accompanying 
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Examples, and especially Example 6.3, for domain identification, and schema and 
crossover disruption based on distance criteria. 

Once the domains and sets of interacting building blocks are identified and a 
schema is determined, a crossover disruption E c is determined for each domain. The 
5 results for all domains of the polymer can be plotted as a schema disruption profile, as 
described herein, and in a manner similar to a crossover disruption profile. To determine 
the crossover disruption and generate a profile, a threshold disruption value is set. The 
contribution of each residue of each domain to the structural integrity or fitness of the 
polymer is evaluated, based on the degree to which it interacts with each other residue 

1 0 of each other domain. This is compared to the threshold crossover disruption, which is 
determined empirically or is modeled as a probability as described above for E c in a DNA 
shuffling recombination context. Domains which exhibit a low crossover disruption 
compared to the threshold are "rejected", meaning they can be substituted without 
disrupting the structure. Domains which exhibit a high crossover disruption are 

1 5 "accepted", meaning that they are schema which should be preserved in the offspring. 
This follows from the principles described above. Domains which are essential or 
important to the structural integrity or shape of the polymer (which have a high crossover 
disruption) should not be disrupted by recombination, in favor of crossovers in domains 
that are less essential or important to the structural integrity or shape of the polymer (they 

20 - havealow crossover disruption). It shouldbenotedhowever, that the tenns- 'accept" and - 
Reject" (FIG. IB) are relative, and could be interchanged, depending on the desire point 
of view. Thus, domains with a low crossover disruption could be "accepted" as 
candidates for crossover recombination. Domains with a high crossover disruption would 
be Rejected" for crossover recombination, so that those domains can be protected or 

25 preserved. 

The process of accepting and rejecting domains to generate a schema disruption 
profile can be performed iteratively, until all residues of all domains are identified and 
their relative contribution to the structure of the polymer is determined. When this is 
"Done" (FIG. IB), the data is used to mark all domains that are disruptive, so that they 
30 will be preserved - crossover recombinations in these domains will not be modeled or 
performed. From the remaining domains, optimal crossovers can be identified. These 
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are the sets of possible crossovers within the low disruption domains that are calculated 
to perturb the polymer the least, while offering the best chances for new or improved 
properties. 

The last two steps of FIG, IB are optional. If a recombination protocol is to be 
5 used for directed evolution experiments, the protocol may have restrictions on the 
crossover locations which are accessible to the method, or the number and manner in 
which crossovers occur. Using a cut point or fragment file which identifies and 
represents these restrictions, the sequence space of optimal crossovers from the previous 
steps can be further limited or reduced, to those which also satisfy the restrictions of the 
10 experimental protocol. For example, protocols based on homologous recombination, 
sequence identity or alignments, e.g. as depicted in FIG, 1A and FIG, 12, maybe used 
in combination with the non-homologous methods described here and by reference to 
FIG. IB. 

Conceptually, a set of possible parents is selected based on structural similarity. 

15 In one embodiment, the parents can be identified based on regions of sequence identity. 
Using the computational methods described herein, a set of all possible cut points for 
these parents can be generated. These computations are independent of any constraints 
on recombination, for example limitations which may be posed by particular protocols 
for directed evolution. The set of optimum cut points can then be determined from the 

20 set of all possible cut points, using the methods of the invention. More particularly, cut 
points are selected to minim ize the disruption of coupling interactions in the three- 
dimensional structure of the protein. Recombination or evolution methods can then be 
selected and adapted to cut and recombine the parents at the selected cut points. 

In preferred methods of the invention, once the parental sequences are aligned 

25 and candidate cut points identified, the structure or conformation of one of the parent 
sequences is also obtained or otherwise provided (FIG. 1 A). The preferred method of 
the invention requires the structure or conformation of a parental amino acid be obtained 
or otherwise provided. In many preferred embodiments, and particularly in embodiments 
where the parent sequence is the sequence for a known protein or nucleic acid, the 

30 structure or conformation of the parent sequence will be known and can be obtained from 
any of a variety of resources (for a review, see Hogue et aL, Methods Biochem. Anal. 
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1998, 39:46-73). For example, and not by way of limitation, the Protein Data Bank 
(PDB) (Berman et al 9 NucL Acids Res. 2000, 28:235-242) is a public repository of 
three-dimensional structures for a large number of macromolecules, including the 
structures of many proteins, nucleic acids and other biopolymers. 
5 Alternatively, in many embodiments the structure of a polymer (e.g., protein) 

sequence that is similar or homologous to the parent sequence will be known. In such 
instances, it is expected that the conformation of the parent sequence will be similar to 
the known structure of the homologous polymer. The known structure may, therefore, 
be used as the structure for the parent sequence or, more preferably, may be used to 

10 predict the structure of the parent sequence (i.e., in homology modeling"). As a 
particular example, the Molecular Modeling Database (MMDB) (see, Wang et al , NucL 
AcidsRes. 2000,28:243-245; MxMer-BnieietaL, NucL Acids Res. 1999,27:240-243) 
provides search engines that maybe used to identifyproteins and/or nucleic acids that are 
similar or homologous to a parent sequence (referred to as "neighboring" sequences in 

15 the MMDB), including neighboring sequences whose three-dimensional structures are 
known. The database further provides links to the known structures along with alignment 
and visualization tools whereby the homologous and parent sequences may be compared 
and a structure may be obtained for the parent sequence based on such sequence 
alignments and known structures. 
- 2G - In other embodiments j where the-structure for a particular parent sequence may - 
not be known or available, it is typically possible to determine the structure using routine 
experimental techniques (for example, X-ray crystallography and Nuclear Magnetic 
Resonance (NMR) spectroscopy) and without undue experimentation. See, e.g., NMR 
of Macromolecules: A Practical Approach, G.C.K. Roberts, Ed., Oxford University 

25 Press Inc., New York (1993). Alternatively, and in less preferable embodiments, the 
three-dimensional structure of a parent sequence maybe calculated from the sequence 
itself and using ab initio molecular modeling techniques already known in the art. 
Three-dimensional structures obtained from ab initio modeling are typically less reliable 
than structures obtained using empirical (e.g., NMR spectroscopy or X-ray 

30 crystallography) or semi-empirical (e.g., homology modeling) techniques. However, such 
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structures will generally be of sufficient quality, although less preferred, for use in the 
methods of this invention. 

Calculation of a Schema Disruption Profile 
5 Once the three dimensional amino structure of one of the parental sequences is 

determined, the method of the invention provides for the determination of coupling 
interactions between pairwise amino acid side chains. In a preferred embodiment the 
coupling interactions are represented by the use of a coupling matrix, as described infra. 
A matrix can be presented diagrammatically, or its members can be described in 
1 0 numerical or binary fashion. For example, if residues 3 and 8 of a structure are the only 
coupled residues, then the (3,8) and (8,3) members or cells of the NxN matrix can be set 
to 1, and all other cells are set to 0. 

The coupling interactions can be defined by the determination of conformational 
energy between residues, or based on distance parameters such as interatomic distances 
15 (the distances between atoms in residues of the polymer). Calculations based on 
distances are preferred. An energy or distance measure that is outside a certain threshold 
between residues can be used to determine that the residues are considered to be 
uncoupled. For example, in embodiments based on conformational energy or distance, 
only those residues that exhibited a stabilization or conformational energy below a 
.. 20. defined, threshold,, or within a threshold, interaction distance,. are considered to be.. . 
coupled. For example, in a preferred conformational energy embodiment, the threshold 
was defined as 0.25 kcal/mol. 

5.5 Modeling Recombination Based on Fragment Restrictions 
25 According to the invention, recombination protocols that limit or restrict the 

fragments which can recombine can be modeled, and optimal crossovers from a set or 
subset of fragments can be determined. 
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Sequence Identity Based Recombination 

In this example, the invention is used to model the recombination of DNA 
sequencer using methods that rely or depend on sequence identity. FIG 1 A provides a 
flow diagram illustrating a general, exemplary embodiment of the methods used in this 
5 invention. A skilled artisan can readily appreciate that certain steps may be omitted and 
the order of the steps may be changed. In particular, the flow diagram in JIG. 1 A as well 
as other examples presented in Section 6, infra, describe preferred embodiments where 
the methods were used in directed evolution of a protein or other polypeptide. Those 
skilled in the art can readily appreciate, however, that the methods illustrated by these 
10 examples and throughout this specification may be used to modify any polymer or 
biopolymer, including any amino acid or nucleotide sequence, or any DNA or RNA 
molecule. 

Parent Sequences. 

The method shown in FIG. 1 A begins with the selection of "parent" polymer 

1 5 sequences. For example, the parent sequences maybe any amino acid sequence and may 
or may not correspond to a naturally occurring polypeptide. Each protein sequence is 
preferably associated with a nucleic acid sequence (e.g., a gene encoding the protein). 
A preferred embodiment utilizes homologous amino acid sequences. Another preferred 
embodiment utilizes non-homologous amino acid sequences. Preferably, the parent 

20^ - sequence is also the sequence for a protein that has some level- or degree of activity or 
function (e.g., catalytic activity, binding affinity, solubility, thermal stability, etc.) to be 
optimized. The methods of the invention may then be used, e.g., to optimize the activity 
or function of the parent sequence and/or to optimize the activity in altered conditions. 
For example, in one embodiment the parent sequence may be a protein having a 

25 particular catalytic or other activity, and the methods of the invention may be used to 
identify sequences having the same activity but under different (generally more extreme) 
conditions such as conditions of temperature or of solvent (including, for example, 
solvent polarity, salt conditions, acidity, alkalinity, etc.). In another embodiment, the 
parent sequence may have a particular level or amount of activity (e.g., catalytic activity, 

30 binding affinity, etc.) 9 and the directed evolution methods of the invention may be used 
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to identify sequences having improved levels or amounts of that same activity (e.g., 
higher binding affinity or increased catalytic rate). 
Align Polymer Sequences. 

Once the parental sequences are selected, the sequences are aligned (FIG. 1 A). 
5 The invention contemplates alignment of parental sequences in either nucleic acid or 
amino acid forms. In a preferred embodiment, homologous (evolutionarily related) 
amino acid parental sequences are aligned based upon sequence identity, sequence 
similarity, or a combination of both parameters. The various parameters associated with 
alignment of amino acid sequences is well known in the art. In another preferred 
1 0 embodiment, the parental sequences are aligned as nucleic acid sequences. In a preferred 
embodiment, the nucleic acid sequences are aligned based upon regions of sequence 
identity. 

Alignment of parental sequences can be accomplished visually or with the use of 
algorithm. The invention encompasses the use of, but is not limited to, the following 
15 alignment programs: GAP, BLAST, FASTA, DNA Strider, CLUSTAL, and GCG. The 
invention includes the use of default parameters and standard parameters of the computer 
programs. It preferably includes the use of alignment parameters routinely employed in 
the art. A preferred embodiment of the invention utilizes BLAST amino acid alignment 
program to align homologous sequences. Each parent sequence is aligned with the 
-20 - structure sequence using a BLAST algorithm for comparing two sequences. Tatusova, - 
T. A. & Madden T. L., FEMS Microbiol Lett. 174:247-250 (1999). The BLOSUM62 
matrix is used to score similar amino acids and the open gap and extension gap penalties 
are 1 1 and 1, respectively. 

Determination of Possible Crossover Locations Based on Hybridization 
25 The invention encompasses a computational "in silico" simulation of in vitro and 

in vivo recombination. The types of in vitro recombination that are simulated include, 
but are not limited to various forms of recombination methods such as, DNA shuffling, 
StEP, random-priming recombination, and DNAse restriction enzymes. 

Crossover locations for recombination can be determined based on hybridization 
30 between parents. When parental sequences contain areas of sequence identity, aligned 
sequences can be examined for areas of identity based upon a predetermined subset or 
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number of sequential identical amino acids or nucleotides in two aligned parental 
sequences (FIG. 1 A). A preferred number of sequential identical amino acids related to 
the required length of the DNA for hybridization to occur in a particular recombination 
experiment. A preferred embodiment is to search for regions of four identical amino 
5 acids, or six identical nucleotides shared by the parents. After identification of the areas 
that meet the selected parameters of sequence identity a cut point in the identified area 
of sequence identity on the parental sequence is selected as a crossover location. The 
placement of the cut point within a crossover regions is not critical. As one example, the 
cut point maybe selected at any location within the identified region of sequence identity. 

10 In one particular embodiment of the invention, a computational algorithm was 

utilized to mimic DNA shuffling recombination. Starting with the aligned parental DNA 
sequences and their respective possible crossover locations (i.e., all possible cut points), 
a randomly selected parental DNA sequence served as the initial template and was 
copied to mutant offspring. When an identified candidate crossover location was reached 

15 in the copying process the parental template was switched to a randomly selected 
different parental template under specified conditions. In a preferred embodiment, the 
specified conditions were set as follows: (1) a randomly chosen number between 0 and 
1 was less than a threshold of P c (e.g. 0.03) and (2) a minimum of eight amino acids 
between identified crossover locations where crossovers actually occurred. The value P c 

20 - represents the average number of fragments- that each parent gene is cut -into. For • 
example, in DNA shuffling experiments, this parameter is related to the time that the 
parent template DNA is exposed to the DNA-cleaving enzyme DNAse. The expression 
to obtain P c from the average number of fragments f is P c = (f-l)/N, where N is the size 
of the gene. The value 0.03 was set to model the fragment size reported by Stemmer, 

25 supra, for the beta-lactamase shuffling experiment. 
Determining Crossover Disruption. 

The computational method of the invention predicts locations on parental 
sequences where recombination should be most successful due to minimal disruption of 
tertiary amino acid interactions in a crossover mutant. A crossover disruption E c for each 
30 mutant is determined. In one embodiment of the invention, coupling interactions are 
considered disrupted if one of the amino acid pairs of an interacting pairs is replaced with 
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an amino acid from a different parent sequence in the hybrid mutant protein. The 
crossover disruption for a particular mutant is determined by the summation of all 
coupled interactions that are considered disrupted. 

Election of a crossover mutant with minimal crossover disruption. 
5 Once the crossover disruption for a pool of mutant biopolymers is determined, a 

threshold is applied to screen the mutant biopolymers for those mutants that exhibit 
minimal amounts of crossover disruption. Non-limiting examples of selection parameters 
include the following: (1) an application of a threshold, (2) selection of 10% of the 
mutant pool that exhibited the least amount of crossover disruption, (3) selection of thel 0 

1 0 mutants that exhibited the least amount of disruption, (4) selection of crossover mutants 
exhibiting a crossover disruption below an average value, (5) selection of crossover 
mutants exhibiting crossover disruption below a first standard deviation or more. In a 
preferred embodiment of the method, a threshold is applied such that 1% of the total 
mutant pool is allowed by the threshold. In another embodiment of the invention, a'more 

15 stringent threshold is utilized, whereby only 0.001% of the pool is allows by the 
threshold. 

A variation of this method, as depicted in the flow chart of FIG. 1A is shown 
diagrammatically in FIG. 12. 

-20 Non-Sequence Identity Based-Recombination - - -■ — • 

Recombination that is not dependent on sequence identity can be also be modeled 
according to the invention. This can be called "non-homologous" recombination. 
Schema based on structural features of parent polymers are identified, such as three- 
dimensional domains of a protein, and accordingly, it is not necessary to align parent 

25 polymers in this approach. 

Other recombination methods limit the number of fragments and the locations for 
crossovers between the parents. For example, the ITCHY protocol limits recombination 
to one crossover point. Other known protocols use restriction enzymes to cut at very 
specific locations in the gene, based on a stretch of DNA sequence 3-5 nucleotides long. 

30 If restriction enzymes are used to fragment the parents, then crossovers occur based on 
the set of restriction enzymes chosen by the researcher. For example, if a restriction 
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enzyme is chosen that only cuts at ATGG, then crossovers can only occur where ATGG 
appears in the parental DNA sequence. 

Further, methods based on limiting the fragments to the recombination of exons 
can be employed. (Exons are naturally occurring fragments of the gene that precede the 
5 splicing step of transcription). By restricting the fragments, the potential locations of 
crossovers are restricted. The restrictions that result from these methods can be included 
in the calculations and computation described here, for example by noting the potential 
crossover points and either reconstructing possible chimeric mutants, as described infra, 
or by noting the location of these crossover points with respect to the disruption of 
10 schema. 

The schema disruption calculation provides a guide for both the restriction- 
enzyme-based and exon-based recombination methods. From a starting database of 
exons or restriction enzymes, a subset can be chosen that generate crossover locations 
that m inimi ze the schema disruption. This subset has a higher likelihood of generating 
1 5 chimeric mutants that are structurally stable, thus generating libraries where improvement 
in the desired properties are more likely. 

5.6 Directed Evolution Methods to Target Optimal Crossover Locations 

The methods described above are particularly useful for directed evolution 
. 20 experiments, e.g., to obtain-proteins,-nucleic acids or mother .polymers having one or- more - 
desirable properties. For example, the computational models and protein design 
algorithms can be used with directed evolution techniques to target mutants or hybrids 
within a subset of the total sequence space, and particularly within a sequence space 
corresponding to higher fitness probabilities. Accordingly, the inventionprovides genetic 

25 engineering methods, including methods of directed evolution, for obtaining polymers 
that have one or more improved properties. The improved properties include any 
property or combination of properties that can be detected by a user and include, for 
example, properties of catalytic activity (for example, increased rates of catalysis), 
properties of stability (for example, increased thermal stability) or properties of binding 

30 affinity (for example, increased affinity for a particular ligand or increased affinity for a 
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substrate) to name a few. Preferably the desirable property is a property that can be 
detected in a screening assay. 

Mutagenesis and Recombination 
5 In general, directed evolution methods comprise selecting at least one polymer 

sequence. The polymer sequence is preferably the sequence for a biopolymer (e.g., a 
nucleic acid or a polypeptide) that has a particular property or properties of interest. For 
example, the particular property of the parent may be a particular catalytic activity, 
binding to a particular substrate or ligand, thermal stability or a combination thereof. 

1 0 Preferably the property is one that can be readily determine or evaluated by a screening 
assay, e.g. a high throughput screen. One or more residues of the parent polymer 
sequence is then selected or targeted for mutation. In traditional methods for directed 
evolution, selection is random. For example, all or a large fraction of the residues are 
available and/or are selected, e.g., by error prone PCR or DNA shuffling. However, in 

15 the directed evolution methods of the invention, specific residues in the parent sequence 
are identified as candidate crossover locations. The crossover locations may be 
identified, for example, according to the analytical methods described above. 

One or more, and preferably a plurality of mutant polymer sequences may then 
be generated based on the parent sequence. In particular, the directed evolution methods 

20 of the invention preferably generate a plurality .of mutants . which are. identical. to..the.. 
parent sequence except that one or more structurally tolerant residues are mutated. 
Polymers having the mutant sequences may then be generated using polymer synthesis 
and or recombinant technologies well known in the art, and the polymers having these 
mutant sequences are then preferably screened for the one or more properties of interest. 

25 In particular, methods of directed evolution typically have, as their goal, the selection 
and/or identification of polymers (in particular, modified polymers) wherein one or more 
particular' properties of interest are altered, and are preferably improved. For example, 
a directed evolution method may have, as its goal, the selection of polymers that have 
improved catalytic activity (e.g., a higher rate of catalysis), improved (e.g., stronger) 

30 binding to a particular ligand or substrate, or greater thermal stability. Therefore, in 
preferred embodiments one or more of the mutant polymers are selected where one or 
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more of the properties of interest are different from the parent sequence. Preferably, the 
one or more properties of interest are improved in the selected polymer sequences. 

In preferred embodiments, methods of directed evolution may be repeated to 
generate and identify polymers where one or more properties of interest progressively 
improve with each iteration. Accordingly, in a preferred embodiment, one or more of the 
selected polymers may be selected as a new parent sequence, for use in a next round of 
iteration in the directed evolution method. Crossover locations in the new parent 
sequence may then be identified and selected, and a second generation of mutants can be 
generated and screened as described above. Improved mutants may also be recombined 
if desired, using conventional genetic engineering techniques, to obtain further variations 
and improvements. These processes may be repeated as desired, to obtain successive 
generations of mutants. 

Polymer Evolution Techniques 
15 Methods for the directed evolution of polymers such as nucleic acids and 

polypeptides are well known in the art. See, for example, Dube et al, Gene, 137:41 
(1993); Moore & Arnold, Nature Biotechnology, 14:458 (1996); Joo et al, Nature, 
399:670(1999); Zhao & Arnold, Protein Engineering, 12:47 (1999); Skandalis etal, 
Chem.BioL,4:m-m (1997); K\ko\ovzetal,Proc. Natl. Acad. Sci U.S.A., 95:14675 

20- - {1998);- Miyazaki & Arnold, J. Molecular Evolution, 49:716 {1999). See, also, U.S. - 

Patent Nos. 5,741,691 and5,811,238; International Patent Applications WO 98/42832, 
WO 95/22625, WO 97/20078, WO 95/41653, and U.S.- Patent Nos. 5,605,793 and 
5,830,721. Generally, such methods work by selecting a parent sequence, typically a 
particular protein, and generating large numbers of mutants, for example by error prone 
25 PCR of a gene encoding the selected protein. The mutants are then tested, preferably in 
a screening assay, to identify mutants that actually have an improved property detected 
in the assay (for example, increased catalytic activity, or stronger binding to a ligand or 
substrate). These mutants are selected and again mutated, and the second generation of 
mutants is again tested to identify new mutants where the property is further improved. 
30 Thus, traditional directed evolution methods randomly search through the sequence space 
of a polymer one residue at a time to identify mutants with an increased fitness. 
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Such traditional methods are limited, however, by the finite capacity of existing 
assays to screen mutants. Existing screening assays may observe and/or select from 
between about 10* or 10 12 mutants, depending on the particular method. However, for 
a typical protein of 300 amino acid residues the number of possible amino acid 
5 combinations is about 1 0 390 . Thus, screening assays can only observe a small fraction of 
sequences in the sequence space of a given parent. 

Using the analytical methods described above, a user can improve upon such 
existing methods by identifying locations on polymers that allow crossovers to occur 
while maintaining their function and specifically selecting those locations for mutation 
10 in the iterative step of a directed evolution experiment. In preferred embodiments a user 
may identify and target residues that have crossover locations that exhibit crossover 
disruption below a certain value in in vitro experiments. 

The invention encompasses, but is not limited to, the following examples of in 
vitro techniques: (1) fragmentation and reassembly techniques (e.g. the Stemmer DNA 
15 shuffling method, Stemmer, Nature 1994, 370:389; (2) staggered extension process 
(StEP)(Zhao et al., Nature Biotechnology 1997, 49:290);(3) synthesis techniques, and (4) 
PCR based targeting. It will be understood by practitioners that these and other methods 
can be used in the invention, and that these methods may be applied to any number of 
parents and cut points. The recombination techniques of the invention include in vitro 

20 " and in vivo "recombination, as well as methods which combine-both approaches, and"*" 

further, recombinants can be cloned and/or expressed by host cells according to known 
techniques. 

Fragmentation and reassembly techniques utilize a restriction enzyme or set of 
restriction enzymes at specific concentrations to selectively cut biopolymer strands at 

25 identified locations. The choice and concentration of enzyme(s) are determined based 
upon the identified optimal crossover locations determined by the method of the 
invention. The method can be applied to homologous and non-homologous nucleic acid 
sequences. The resulting DNA fragments, produced by the restriction enzyme digest, can 
be reassembled by techniques known in the art, thereby creating hybrid parental DNA 

30 strands that can be used as templates for the production of proteins. The invention also 
encompasses the fragmentation and reassembly of amino acid sequences. The 
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fragmentation and reassembly may be accomplished, for example, by the use of chemical 
methods or enzymes for homologous or non-homologous amino acid sequences. 

Alternatively, the StEP method ( Zhao et al , Nature Biotechnology 1 997, 49:290) 
biases the creation of mutant hybrid proteins towards mutations at desired crossover 
5 locations. A set of DNA primers are synthesized to hybridize with equal probability to 
all parental strands at desired crossover recombination locations. The desired hybrid 
DNA sequence can be created by chemically synthesizing the desired DNA sequence or 
ligating synthesized fragments of the desired DNA sequence. One method is to 
synthesize fragments based upon optimal crossover locations from all the parents and 

10 randomly anneal the fragments to produce a recombinant library. A related method 
reduces the need to synthesize each full length parental gene by encompassing the use of 
overlap extension, a DNA polymerase, and partial synthesis of the genes of interest to 
create the full length gene of interest. 

StEP Recombination, This approach is illustrated in F1G.7 for two crossovers 

15 and two parental genes. Split pool synthesis can be used to minimize the synthesis 
burden. The method of Volkov et al 9 Nucl Acids Res., 27:18 (1999) may be used. As 
shown in FIG. 7, a "grey" parent and a ''black" parent are each cut at positions 1 and 2. 
Crossover recombination at these cut points or crossover regions generates eight possible 
recombinants, including two that are identical to one of the parents. The remaining six 

20 -recombinants have mutant sequences with contributions from each parent that cross over 
to a contribution from the other parent at one or both cut points. See, FIG. 7, part (A). 
Each of these recombinants can be made by assembly of synthetic fragments that contain 
the cut points or crossover locations, i.e. at least one of each pair of fragments to be 
joined contains residues from one or the other parent that extend past the cut point, as 

25 shown in FIG. 7, part (B). In this example, the terminal fragments have end primers that 
include a cut point, resulting in four possible fragments on the left, four on the right, and 
two (one from each parent) in the middle. These fragments can be reassembled in eight 
different sets of three, to produce each of the eight recombinants in FIG. 7, part (A). 
In Vitro -In Vivo Recombination. A hybrid invitro-in vivo recombination method 

30 is outlined in FIG. 8. In FIG. 8, the method pertains to the shuffling of two parental 
genes. The method encompasses gene assembly using synthetic fragments and overlap 
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extension with fragments followed by gap repair, which creates double stranded 
sequences containing mismatched regions. The mismatches are then repaired randomly 
in vivo when inserted into an appropriate host cell in the form of a heteroduplex plasmid. 
This method removes parental homoduplexes and results in a library of random 
5 crossovers near the mismatched sites for each of the two reactions. Further complexity 
(more crossovers) can be added easily by adding fragments corresponding to desired 
crossover points. 

In FIG. 8, a "grey" parent and a "black" parent represent polymers (e.g. genes), 
to be cut and reassembled at two cut points. Synthetic fragments from each parent are 

1 0 extended at a cut point to correspond with the sequence of the other parent, by using the 
other parent as a template. For example, fragments derived from the black parent are 
extended at designated cut points with sequences from the grey parent, using the grey 
parent as a template. Fragments derived from the grey parent are likewise extended using 
the black parent as a template. This produces polymer duplexes, e.g. double strands of 

15 nucleic acid residues, representing the different possible combinations of fragments. 

In the example of FIG. 8, with two cut points, two sets of four different duplexes 
are possible, for a total of eight duplexes. These represent the eight possible 
recombinations of sequences from the two parents by crossovers at the two cut points. 
Two of these duplexes are homoduplexes, meaning that the sequences of both polymers 

20 - are identical to each other f They are also each identical to one of the parent-polymers, 

The remaining six duplexes are heteroduplexes, meaning that the sequences of each 
polymer in the duplex pair are different. One member of each heteroduplex has a 
sequence identical to one of the parents. The other member of each heteroduplex pair is 
a crossover recombinant, with a sequence that crosses over from one parent to the other 

25 at one or more of the cut points. In this example, with two cut points, a crossover can 
occur at one or both cut points, resulting in two sets of three recombinant sequences that 
differ from parent sequences. As shown in FIG. 8, these six crossover recombinants are 
(black-grey-black), (grey-grey-black), (black-grey-grey), and the "reverse" set of 
recombinants (grey-black-grey), (black-black-grey), and (grey-black-black). 

30 The duplexes produced by this method can be introduced to an appropriate host 

cell for heteroduplex recombination, which serves to remove the parent homoduplexes. 
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The result is a library of crossover recombinants having sequences contributed by both 
parents. 

It will be understood that this discussion and FIG 8 is an illustration of a general 
technique that is applicable to the inventions. For example, more than two parents ad/or 
5 more than two cut points can be used. 

PCR Amplification. Another method is outlined in FIG. 9. Gene fragments for 
reassembly can be prepared by PCR with primers directed for crossovers. The primers 
can be designed such that a single primer will hybridize equally to all parent strands at 
the desired positions at crossover locations. The fragments prepared by these reactions 
1 0 are pooled and reassembled by PCR with flanking primers, e.g. 1+6 in the example. The 
resulting PCR products will have crossovers directed to locations of the primers. 

As shown in FIG. 9, several sets of primers are made for each parent polymer. 
One set of primers corresponds to the terminal ends of the polymer. In this examples 
there is one primer for each of the 3' and 5' ends of a polynucleotide, designated 1 and 6 
15 in FIG. 9. Each remaining set of primers corresponds to each cut point, and in this 
example there are two primers for each cut point. These are designated 2 and 3 for the 
cut point at the left, and 4 and 5 for the cut point at the right in FIG. 9. Similar sets of 
primers are prepared for each other parent. PCR amplification is performed using pairs 
of primers that flank adjacent regions of the polymer, e.g. primers 1 and 2, primers 3 and 

.... 20— 4; and primers 5 and- 6.- All of the -possible fragments -from all -of the parents -are 

reassembled in a pool, using PCR reactions starting with primers 1 and 6. 

Family Shuffling. Another method is outlined in FIG 10, which is a DNA 
shuffling method as described e.g. by the 1994 Stemmer references. The recombination 
is directed to specific sites utilizing "crossover" primers. The crossover primers are 
25 synthesized to contain crossover sequences and are used during the reassembly reaction. 
The concentration of the primer can be varied and can be much higher than that of the 
parental genes. 

In this approach, sets of primer pairs are prepared. Each primer of each pair has 
sequences from two parents which span and include a designated crossover location. 
30 FIG. 10, part (A). The parent genes are fragmented, and fragments are reassembled in 
the presence of the primers using PCR amplification. The primers promote reassembly 
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and amplification at the crossover locations they span, to produce complementary 
recombinants with sequences from more than one parent. FIG. 1 0, part (B). Two parents 
and two cut points are shown in this example, but more may be used In the figure, a 
partially reassembled sequence for one recombinant is shown, with terminal sequences 
5 coming from one patent (black) and the middle or intervening sequences coming from 
another parent (grey). 

The methods described above and illustrated by FIGS. 7-10 are novel methods 
for targeting optimal crossover locations, in particular based on the techniques 
calculations described herein, e.g. in Sec. 5.4 above. 

10 

Screening Hybrids With Protected Schema 

According to the invention, crossovers at locations that minimally disrupt 
coupling interactions with other residues are more likely to lead to functional proteins. 
By focusing the crossovers in a directed evolution experiment to residues having 

1 5 crossover locations that minimize the disruption of coupling interactions or domains, the 
number of sequences that must be tested or screened is considerably reduced. 

Referring specifically to embodiments where the parent sequence is a protein or 
other polypeptide sequence, the parent sequence (and mutants thereof) maybe expressed 
in facile gene expression systems to obtain libraries of mutant proteins. Any source of 

20 nucleic acid in purified form can be utilized as the starting nucleic acid. Thus, the 
process may employ DNA or RNA, including messenger RNA. The DNA or UNA may 
be either single or double stranded. In addition, DNA-RNA hybrids which contain one 
strand of each may be utilized. The nucleic acid sequence may also be of various lengths 
depending on the size of the sequence to be mutated. Preferably, the specific nucleic acid 

25 sequence is from 50 to 50,000 base pairs. It is contemplated that entire vectors 
containing the nucleic acid encoding the protein of interest maybe used in these methods. 

Once the evolved polynucleotide molecules are generated they can be cloned into 
a suitable vector selected by the skilled artisan according to methods well known in the 
art. If a mixed population of the specific nucleic acid sequence is cloned into a vector it 

30 can be clonally amplified by inserting each vector into a host cell and allowing the host 
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cell to amplify the vector. The mixed population may be tested to identify the desired 
recombinant nucleic acid fragment The method of selection will depend on the DNA 
fragment desired. For example, in this invention a DNA fragment which encodes for a 
protein with improved properties can be determined by tests for functional activity and/or 
5 stability of the protein. Such tests are well known in the art. 

Using the methods of directed evolution, the invention provides a novel means 
for producing functional, and soluble proteins with improved activity toward one or more 
substrates. The mutants can be expressed in conventional or facile expression systems 
such as £. coli . Conventional tests can be used to determine whether a protein of interest 

1 0 produced from an expression system has improved expression, folding and/or functional 
properties. For example, to determine whether a polynucleotide subjected to directed 
evolution and expressed in a foreign host cell produces a protein with improved activity, 
one skilled in the art can perform experiments designed to test the functional activity of 
the protein. Briefly, the evolved protein can be rapidly screened, and is readily isolated 

1 5 and purified from the expression system or media if secreted. It can then be subjected to 
assays designed to test functional activity of the particular protein. 

A flow chart of an exemplary directed evolution algorithm is illustrated in FIG. 
14. A library of mutants can be made by any of the methods described herein. The 
library can be sorted or restricted using the computational methods of the invention to 

20 identify the most promising subset of "fit" mutants. These can be screened to pick the 
most fit mutant. This process can be repeated in successive generations, until no further 
changes are observed, a set goal is achieved, or the process is ended at any desired step. 

5.8 Implementation Systems and Methods 
25 Computer System. 

The analytical methods described in the previous subsections may preferably be 

implemented by the use of one or more computer systems, such as those described herein. 

Accordingly, FIG. 11 schematically illustrates an exemplary computer system suitable 

for implementation of the analytical methods of this invention. Computer 201 is 
30 illustrated here as comprising internal components linked to external components. 



-60- 



WO 03/055978 



PCT/US02/34374 



However, a skilled artisan will readily appreciate that one or more of the components 
described herein as "internal" may, in alternative embodiments, be external. Likewise, 
one or more of the "external" components described here may also be internal. The 
internal components of this computer system include processor element 202 
5 interconnected with a main memory 203. For example, in one preferred embodiment 
computer system 201 maybe a Silicon Graphics R10000 Processor running at 195 MHz 
or greater and with 2 gigabytes or more of physical memory. In another, less preferable, 
exemplary embodiment, computer system 201 maybe an Intel Pentium based processor 
of 150 MHz or greater clock rate and the 32 megabytes or more of main memory. 

1 0 The external components may include a mass storage 204. This mass storage may 

be one or more hard disks which are typically packaged together with the processor and 
memory. Such hard disks are typically of at least 1 gigabyte storage capacity, and more 
preferably have at least 5 gigabytes or at least 10 gigabtyes of storage capacity. The mass 
storage may also comprise, for example, a removable medium such as, a CD-ROM drive, 

15 a DVD drive, a floppy disk drive (including a Zip™ drive), or a DAT drive or other 
Other external components include a user interface device 205, which can be, for 
example, a monitor and a keyboard. In preferred embodiments the user interface is also 
coupled with a pointing device 206 which may be, for example, a "mouse" or other 
graphical input device (not illustrated). Typically, computer system 201 is also linked 

20 to a network link 207, which can be part of an Ethernet or other link to one or more other, 
local computer systems (e.g., as part of a local area network or LAN), or the network link 
may be a link to a wide area communication network (WAN) such as the Internet. This 
network link allows computer system 201 to communicate with one or more other 
computer systems. 

25 Typically, one or more software components are loaded into main memory 203 

during operation of computer system 201. These software components may include both 
components that are standard in the art and special to the invention, and the components 
collectively cause the computer system to function according to the analytical methods 
of the invention. Typically, the software components are stored on mass storage 204 

30 (e.g., on a hard drive or on removable storage media such as on one or more CD-ROMs, 
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RW-CDs, DVDs, floppy disks or DATs) . Software component 210 represents an 
operating system, which is responsible for managing computer system 201 and its 
network interconnections. This operating is typically an operating system routinely used 
in the art and may be, for example, a UNIX operating system or, less preferably, a 
5 member of the Microsoft Windows™ family of operating systems (for example, 
Windows 2000, Windows Me, Windows 98, Windows 95 or Windows NT) or a 
Macintosh operating system. 

Software component 211 represents common languages and functions 
conveniently present in the system to assist programs implementing the methods specific 

10 to the invention. Languages that maybe used include, for example, FORTRAN, C, C++ 
and less preferably JAVA. 

The analytical methods of the invention may also be programmed in mathematical 
software packages which allow symbolic entry of equations and high-level specification 
of processing, including algorithms to be used, thereby freeing a user of the need to 

15 procedurally program individual equations and algorithms. Examples of such packages 
include Matlab from Mathworks (Natick, Massachusetts), Mathematica from Wolfram 
Research (Champaign, Illinois) and S-Plus from Math Soft (Seattle, Washington). 
Accordingly, software component 212 represents the analytic methods of the invention 
as programmed in a procedural language or symbolic package. 

20 The memory 203 may, optionally, further comprise software components 213 

which cause the processor to calculate or determine a three-dimensional structure for a 
macromolecule and, in particular, for a given polymer sequence such as a protein or 
nucleic acid sequence. Such programs are well known in the art, and numerous software 
packages are available. This software includes Swiss-Pdb Viewer (Glaxo Wellcome 

25 Experimental Research); Biograf (Molecular Simulations, Inc); 0 (generally used for 
crystallography); Explorer (MSI); Quenta, CHARMM; and Sybil (Tripos). The memory 
may also comprise one or more other software components, such as one or more other 
files representing, e.g., one or more sequences of polymer residues including, for 
example, a parent sequence and/or other sequences (for example, mutant sequences). The 

30 memory 203 may also comprise one or more files representing the three-dimensional 
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structures of one or more sequences, including a file representing the three-dimensional 
structure of a parent sequence, such as a parent protein or nucleic acid. 

ComputerProgram Products. 
5 The invention also provides computer program products which can be used, e.g., 

to program or configure a computer system for implementation of analytical methods of 
the invention. A computer program product of the invention comprises a computer 
readable medium such as one or more compact disks (i.e., one or more "CDs", which may 
be CD-ROMs or a RW-CDs), one or more DVDs, one or more floppy disks (including, 

10 for example, one or more ZIP™ disks) or one or more DATs to name a few. The 
computer readable medium has encoded thereon, in computer readable form, oneormore 
of the software components 212 (FIG. 11) that, when loaded into memory 203 of a 
computer system 201, cause the computer system to implement analytic methods of the 
invention. The computer readable medium may also have other software components 

15 encoded thereon in computer readable form. Such other software components may 
include, for example, functional languages 211 or an operating system 210. The other 
software components may also include one or more files or databases including, for. 
example, files or databases representing one or more polymer sequences (e.g. protein or 
nucleic acid sequences) and/or files or databases representing one or more 

20 three-dimensional structures for particular polymer sequences (e.g., three-dimensional 
structures for proteins and nucleic acids. 

System Implementation. 

In an exemplary implementation, to practice the methods of the invention a 

25 parent sequence may first be loaded into the computer system 201 (FIG. 11). For 
example, the parent sequence may be directly entered by a user from monitor and 
keyboard 205 and by directly typing a sequence of code of symbols representing different 
residues (e.g., different amino acid or nucleotide residues). Alternatively, a user may 
specify parent sequences, e.g., by selecting a sequence from a menu of candidate 

30 sequences presented on the monitor or by entering an accession number for a sequence 
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in a database (for example, the GenBank or SWISPROT database) and the computer 
system may access the selected parent sequence from the database, e.g., by accessing a 
database in memory 203 or by accessing the sequence from a database over the network 
connection, e.g., over the internet. 
5 The programs may then cause the computer system to obtain a three-dimensional 

structure of the parent sequence. For example, the three-dimensional structure for the 
parent sequence may also be accessed from a file (for example, a database of structures) 
in the memory 203 or mass storage 204. Alternatively, the three-dimensional structure 
may also be retrieved through the computer network (e.g., over the network) from a 

1 0 database of structures such as the PDB database. In yet other embodiments, the software 
components may, themselves, calculate a three-dimensional structure using the molecular 
modeling software components. Such software components may calculate or determine 
a three-dimensional structure, e.g., ab initio or may use empirical or experimental data 
such as X-ray crystallography or NMR data that may also be entered by a user of loaded 

15 into the memory 203 (e.g., from one or more files on the mass storage 204 or over the 
computer network 207). The software components may further cause the computer 
system to calculate a conformational energy for the parent sequence using the 
three-dimensional structure. 

Finally, the software components of the computer system, when loaded into 

20 memory 203, preferably also cause the computer system to determine a coupling matrix 
or, in the alternative, a parameter related to or correlating with coupling interactions 
according to the methods described herein. For example, the software components may 
cause the computer system to generate one or more mutant sequences of the parent and, 
using the conformation determined or obtained for the parent sequence, determine 

25 coupling interactions and well as disrupted coupling interactions. 

Upon implementing these analytic methods, the computer system preferably then 
outputs, e.g. , the coupling constants of the parent sequence or the disruption profile of the 
mutant pool. For instance, the coupling interactions may be output to the monitor, 
printed on a printer (not shown) and/or written on mass storage 204. In preferred 

30 embodiments, the software components may also cause the computer system to select and 
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identify one or more particular crossover locations in the parent sequence for mutation, 
e.g., in a directed evolution experiment. For example, the computer system may identify 
residues of the parent sequence having as crossover locations that minimally disrupt 
coupling interactions. These residues could be identified, for a user, as ones which, if 
5 mutated, are most likely to improve properties of the polymer in a directed evolution 
experiment while retaining function. 

Alternative systems and methods for implementing the analytic methods of this 
invention are also intended to be comprehended within the accompanying claims. In 
particular, the accompanying claims are intended to include the alternative program 
1 0 structures for implementing the methods of this invention that will be readily apparent 
to those skilled in the relevant art(s). 

6. EXAMPLES 

The present invention is also described by means of particular examples. 

1 5 However, the use of such examples anywhere in the specification is illustrative only and 
in no way limits the scope and meaning of the invention or of any exemplified term. 
Likewise, the invention is not limited to any particular prefeired embodiments described 
herein. Indeed, many modifications and variations of the invention will be apparent to 
those skilled in the art upon reading this specification and can be made without departing 

20 form its spirit and scope. The invention is therefore to be limited only by the terms of the 
appended claims along with the full scope of equivalents to which the claims are entitled. 

6.1 Computational Determination of Structural Schema 

Structural schema of a biopolymer, e.g. a gene or protein, can be identified, and 

25 crossover disruption profiles of identified schema can be calculated. These calculations 
can be used to predict optimal crossover locations and resulting recombinant offspring 
that are more likely to be stable, and exhibit new or improved properties. Schema 
disruption profiles can be based on energy or distance calculations, or both. A preferred 
method, for its relative computational efficiency, is based on interatomic distances. 

30 Crossover Disruption Based on Interatomic Distances 
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Computing the distances between atoms, rather than a detailed energy calculation, 
can significantly accelerate the calculation of coupling interactions between residues. To 
perform this calculation, a structure file (such as a Protein Databank PDB or BiografBGF 
file) is read that contains the coordinates for each atom of this structure. The distances 
5 between all atoms are calculated with the equation, 



where d 9 is the distance between atoms i and j t and (x^^D are the three-dimensional 
coordinates of atom i. Two residues are considered coupled if any of their atoms (both 
10 side chain and main chain, excluding hydrogens) are within a cutoff distance d c . The 
parameter d c is set such that the average number of coupling interactions per residue is 



used to keep track of the coupled residues. An element of this matrix c y is equal to one 

15 if residues i and j are within distance d e and is zero otherwise. 

Despite the beneficial reduction in computation complexity, the disruption results 
based on distance rather than energy calculations are not significantly altered. FIG. IS 
compares the calculation of the single-cut-point crossover disruption for transformylase 
based on the distance (top) and energy (bottom) definitions of coupling. The qualitative 

20 shape of both.plots is similar and a quantitative comparison of both measures ( yields R 2 
= 0.91 . FIG. 16(A) shows a plot based on energy, and FIG. 16(B) shows a plot based 
on distance. Due to the significant improvement in calculation time, the distance-based 
definition of coupling is a preferred mode for the disruption calculations. 



d & - Vfe - x J + 0* - y/J + " z jf 
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between 4 and 12. The preferred value for d c is 4.0 angstroms, corresponding to 
approximately 7-8 interactions per residues. A two-dimensional coupling matrix c is 



The crossover disruption of a fragment can then be calculated using the equation 



25 
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where i^a indicates the summation over the residues not in the fragment a and iea 
indicates the summation over the residues in fragment a. N T is the total number of 
residues, N d is the number of residues in fragment a, c v is the coupling matrix and P t is 
the probability that two parents have different amino acid identities at residue f. The 
5 probabilities P, and Pj are determined by examining a sequence alignment of the parents 
and counting the number of times that the parents share an amino acid identity at that 
residue according to: 

\* Jj-\ k>i 

10 

where s Jk = 1 if parent j and k have the same amino acid at residue i, and n p is the total 
number of parents. When the sequences of the parents are unknown, or if it is desirable 
to count disruption at positions where the amino acid identities are identical, P t can be 
set to unity for all L Further, Equation (3) could be modified to reflect physio-chemical 
15 similarities (such as charge, hydrophobicity, size) between amino acids, thus weighing 
crossover disruption more heavily when comparing dissimilar amino acids. 

Disrupting Folding Domains 

A data set generated by the experimental engineered shuffling of a thermostable 

20 phytase with a mesostable phytase yields further insight into the disruption caused by 
domain substitution. Jermutus, et al., Structure-based chimeric enzymes as an alternative 
to directed enzyme evolution: phytase as a test case, J.Biotech., 85: 15-24(2001). In this 
experiment, two chimeric proteins were created by extracting small domains (1, 2) from 
the thermophilic (A. niger) phytase and inserting them into the less-stable (A. terreus) 

25 phytase. FIG. 17(A) One chimera (HyA) was created by inserting a surface helix 
(residues 66-82), FIG. 17(1). The second chimera (HyB) was created by inserting a 
buried beta-strand (residues 48-58), (FIG. 17(1). HyA2 was stabilized when compared 
to the A. terreus wild-type and HyBl was significantly destabilized. 
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This is shown by the comparison of melting temperatures (TJ in degrees C for 
HyA (mutant 2) and HyB (mutant 1) reported in the Experimental Data of BIG* 17. The 
melting temperature of HyA (2) is higher than for HyB (1), meaning it is relatively more 
stable (more energy, i.e. a higher temperature, is needed to cause unfolding. Similarly, 
5 the temperatures at which 50% of the HyA and HyB proteins are unfolded (t^ show that 
HyA is more stable and HyB is less stable. FIG. 17 also shows comparison data for 
thermodynamic properties of the wild type A* terreus phytase enzyme (wt) and the wild 
type thermophilic A. niger phytase (wt-insert). 



1 0 calculated for each domain insertion and statistically compared to the disruptiveness of 
all fragments in phytase. The crossover disruption E c ofthe HyA mutant is 8.12 andHyB 
is 10.77 (FIG. 17, Calculations). While HyA is less disruptive, both compare well to the 
average crossover disruption 19.26 (standard deviation 4.09), calculated by determining 
the disruptiveness of all possible fragments. To emphasize this trend, the Z-score was 

1 5 calculated for each chimera, where the Z-score Z, of fragment i is defined as: 



. where JE d , is the crossover disruption of fragment z, <2? c >.is .the. average .crossover. .... 
disruption of all fragments, and s(EJ is the standard deviation of the crossover disruption 
20 of all fragments. The Z-score of HyA is -2.72 and HyB is -2.08 (FIG. 17), indicating 
that while HyA is predicted to be a more acceptable substitution than HyB, both have a 
very low disruption when compared to the average. 

Both chimeras have relatively low crossover disruption values because they are 
both small fragments. Normalizing the crossover disruption measure by the number of 
25 residues in the fragment N d and the total number of residues N T overcomes this effect; 
given by: 



To determine the disruptiveness of the two domains, the crossover disruption was 




(4) 




(5) 
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where E* is the normalized disruption measure. Other possibilities include normalizing 
the crossover disruption by the number of residues in the domain alone, 



5 and normalizing the square-root of the crossover disruption by the number of residues in 
the domain, 



When equation 3 is used to calculate the disruptiveness of the substitutions into phytase, 
HyA has a disruption of 0.005 and HyB has a disruption of 0.014 (FIG. 17). The average 

1 0 disruption for all possible fragments is 0.006 (standard deviation is 0.002). The Z-scores 
of the HyA substitution is -0.86 and the HyB substitution is 4.838, indicating that by 
these measures, the HyB substitution is far more likely to be destabilizing, as was found 
experimentally (Jermutus el al, 2001). In general, Equation (6) is the preferred mode of 
the calculation due to the lack of dependence on the total number of residues. 

15 The noimalized value for crossover disruption (Equation 6) can be used to 

determine the compatibility of isolated fragments when substituted into the remaining 
structure. As an example, the crossover disruption was calculated for fragments that 
appeared in the DNA shuffling experiment with beta-lactamase (Crameri, 1998). Each 
fragment independently exhibits a low crossover disruption. When a list of possible 

20 fragments is known before an experiment, this type of calculation could be used to 
computationally separate a subgroup of fragments that are more likely to produce folded 
chimeras, based on their disruptiveness of the structure. This approach could be applied 
to methods of "exon shuffling," whereby parent genes are fragmented and recombined 
at crossover points based on their natural intron-exon structure on the gene level. 

25 Kolkman & Stemmer, Nature Biotechnology, 19: 423-428 (2000). The computational 
method is able to determine the sets of exons that are least likely to be disruptive when 
substituted into the structure. 




(7). 
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Recombination can cause disruption on two levels of the hierarchical protein 
folding process. First, the internal energy can be disturbed by the substitution of a 
parental fragment that disturbs the interactions that stabilize the structure (the crossover 
disruption). Second, if a fragment causes a highly concentrated region of crossover 
5 disruption, then this region is unlikely to fold. Even if the remainder of the structure has 
a low internal energy (few broken coupling interactions), the locally misfolded region 
would be severely destabilizing. Combined, the phytase and beta-lactamase data sets 
support this view of disruption. In both experiments, crossovers that distributed the 
disruption throughout the gene, rather than localized regions ofhigh crossover disruption 
10 generated stable chimeras. Practically, this implies that it is better to have a large 
absolute crossover disruption (large total E c ) that is well distributed across the gene (low 
E* for all the fragments), than have a small absolute crossover disruption (low total EJ 
that is very localized (large E c * for one fragment). 

1 5 Calculating Compact Units of Structure 

The current view of protein folding is that the process is hierarchical. First, a very 
fast 4< burst" phase occurs where the unfolded polypeptide rapidly collapses into highly- 
compact units, such as alpha-helixes. Next, the substructures condense into the tertiary 
arrangement of the native structure (FIG. 1 8). The experimental observation that folding 

20 is hierarchal has led to the ''building block" theory that proteins have subunits that fold 
and then assist higher-level rearrangements. Tsai, C-J., et aL, Anatomy of protein 
structures: visualizing how a one-dimensional protein folds into a three-dimensional 
shape, Proc. Natl. Acad. Sci. USA, 97: 12038-12043(2000). According to the invention, 
crossovers that do not disrupt these building blocks will be more likely to lead to 

25 functional chimeras. 

A useful tool to visualize local units of condensed structure ('TDuilding blocks") 
is the contact map. Rossman, M. G, & Liljas, A., J. Molec. Biol, 85: 177-181. The 
contact map is constructed by measuring the distance between all alpha-carbons in the 
three-dimensional structure (Equation 1) and then generating a two-dimensional matrix 

3 0 where residues that are within a cutoff distance d„ ss are marked as white whereas residues 
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that lie outside this cutoff distance are marked as black. Domains that occur on the level 
of the one-dimensional polypeptide chain can be identified as triangles that can be drawn 
on the diagonal that do not contain any black regions (FIG. 19). Effectively, this 
identifies fragments of the structure that fold into a sphere of diameter d^. 
5 Several algorithms have been proposed to divide the contact map into regions, 

thus identifying domains in the structure. See e.g., De Souza, et al., Intron positions 
correlate with module boundaries in ancient proteins, Proc. Natl. Acad. Sci. USA, 93: 
14632-14636 (1996); Gilbert, et al., Origin of Genes, Proc. Natl. Acad. Sci. USA, 94: 
7698-7703 (1997); Go, M., Correlation of DNA exonic regions with protein structural 
10 units in haemoglobin, Nature, 291: 90-92 (1981); Go, M., Modular structural units, 
exons, and function in chicken lysozyme, Proc. Natl. Acad. Sci. USA, 80: 1964-1968 
(1983). 

Go originally proposed that lines should be drawn that cross through the largest 
white regions with the intent to separate the black regions. This fragments the structure 
1 5 into domains in a way that minimizes the interaction between the domains. While this 
algorithm was crudely successful in demonstrating the correlation between exons and 
subdomains, it often fails on complicated structures that do not have an obvious domain 
structure. Measuring the number of interactions at each site can quantitate this algorithm, 

- - - 

20 where Ay = 0 if residues f and j are closer than d m and = 1 if residues i andy are 
farther than d^. According to Go, residues that minimize R t are more likely to be 
regions between domains. 

In this example a plot ofi?,. for transformylase was generated. FIG. 20(B). This 
algorithm predicts that there are three domain-forming regions in the protein structure 

25 (three valleys), whereas two were sampled in the in vitro recombination experiment 
(JIG. 20A). This indicates that, while crossovers in this region could form a domain, too 
many coupling interactions are disrupted between the fragments, thus leading to 
destabilized structures. Further, a calculated contact map (FIG. 21) and a plot of 
(FIG. 22) for beta-lactamase show that, while some crossovers occurred in regions that 
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are predicted to separate domains, this algorithm was relatively weak for predicting 
crossover locations. Other domain-separating algorithms based on analyzing the contact 
map have been proposed, but are not reliably consistent when analyzing the locations of 
crossovers in recombination experiments (De Souza et al, 1996; Gilbert et al, 1997). 

5 

SCHEMA: Schema-based Hybrid Protein Optimization 

The present method identifies domains ("building blocks") in proteins based on 
analyzing the contact map to optimize recombinants based on schema. FIG. 23. This 
algorithm is based on searching the protein structure for regions that are compact, based 

1 0 on comparing the length of a fragment with the size of the sphere into which the fragment 
folds. Gilbert and co-workers found that, for a domain diameter of d roa = 21 angstroms, 
the average fragment that can fold into this sphere is 15 residues long with a standard 
deviation of 5 residues (De Souza et al, 1996). In other words, if a fragment of 20 
residues folds such that all the residues are within a sphere of 21 angstroms, then this 

15 fragment can be considered as being highly compact. Further, if a fragment of 15 
residues folds into a sphere of 21 angstrom, then the compactness of this unit is 
statistically average. This observation is utilized here by choosing a minimum fragment 
length n min that, if a fragment of this size or greater is folded into a sphere of diameter 
then this fragment is considered to be compact. Schema theory predicts that these 

20 compact units ( <c building blocks") should not be disrupted by crossovers. 

To determine the regions that are compact, the entire protein structure is scanned 
with fragments of size n mln and greater (FIG. 23). Each fragment is checked for whether 
it can fold into a sphere of d r05S by inspecting the contact map for any regions of black 
(residues that are separated by more than d nss angstroms) in the triangle that defines the 

25 fragment. If there is no back in the triangle, then a compact unit is defined and 
crossovers are disfavored along the fragment because this would disrupt a structural 
building block. To demark this, a schema disruption profile is defined where higher 
values indicate a more disruptive event. The profile is defined by 
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N T N T ( k k 

1,-11' I2 C ^ 



(9) 



where CM m is the element of the contact matrix corresponding to residues m and n y and 
d(/) is a function that is equal to 1 if/ = 0 and equal to 0 if/> 0. Effectively, Equation 
(9) counts the number of times that residue / is involved in a compact unit. A residue that 
5 has a large S t value is involved in a more compact unit than a residue that has a low 5, 
value. 

Making crossovers in building blocks that interact with structure is more 
disruptive than making crossovers in building blocks that are isolated from the remainder 
of the structure. Following this idea, the algorithm combines the crossover disruption 
10 measure (based on the disturbance of coupling interactions) with the domain-based 
disruption measure to identify the compact units that are nucleating in folding (FIG. 23). 
To do this, we add a term to Equation 9 such that fragments that fold into a compact unit, 
but are not interacting with the remainder of the structure are not counted in the schema 
disruption profile. The modified equation is 

' k k 



15 s ( = ZI 5\TLCM m g(E' e (10) 



where the function gfoy) is equal to 1 if x >y and 0, otherwise. The schema disruption 
profile generated by Equation (10) identifies the regions of the protein that are involved 
in a compact unit that significantly contributes to the stability of the protein (many 
coupling interactions). If a crossover occurs in these regions, then it is more likely to 
20 have a destabilizing effect on the structure. 

In Vitro Recombination Results: Beta-lactamase, Transformylase, P450 

The results of the SCHEMA calculation on the transformylase and beta-lactamase 
data sets using the schema-based algorithm are shown in FIGS. 24 and 25, respectively. 
The algorithm rapidly locates the regions in which crossovers are disruptive. The 
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advantages of the schema calculation over the alignment-based algorithm are threefold. 
First, the calculation is deterministic and does not rely on sampling or the method of 
computational hybridization that is used to reconstruct chimeric genes in silico. Second, 
the SCHEMA calculation only requires the structure file and does not rely on the 

5 accuracy ofan alignment algorithm. Finally, the minima in the schema disruption profile 
are the optimal cut points, whereas the maxima in the stochastic algorithm are the 
statistically most likely cut points. 

The algorithm predictions were compared with an in vitro evolution experiment 
that recombined low-sequence identity (25%) P450scc and P450c27 genes. Pikuleva, et 

10 al., Studies of distant members of the P450 superfamily (450scc and 450c27) by random 
chimeragenesis, Archives ofBiochem. And Biophys., 334: 183-192 (1996). In this 
experiment, several chimeras were generated that folded into the native structure. While 
the structures of sec and c27 are unknown, the structure of a mammalian P450 (2C5) was 
recently solved. The schema disruption profile for the 2C5 structure was calculated 

15 (FIG. 26A) and was compared to the crossovers that resulted in folded chimeric 
sequences. The equivalent locations for the crossovers were determined by running a 
BLAST alignment of the local region around the crossovers as reported by Waterman and 
co-workers. Pikuleva, Bjorkhem & Waterman, M. R., Archives ofBiochem. And 
Biophys., 334: 183-192 (1996). These crossovers are in regions that are predicted to be 

20 the least disruptive. In another experiment, a bacterial P450cam and human P450 2C9 
were recombined at a single cut point. Shimoji, M., et al., Design of a novel P450: a 
functional bacterial-human cytochrome P450 chimera, Biochemistry, 37: 8848-8852 
(1998). The chimera that resulted from this rationally-designed cut point folded 
successfully. The crossover occurred at a location that is minimally disruptive in the 

25 P450cam structure and near a minimum in the 2C5 structure (FIG, 26B). Together, these 
recombination experiments lend further support to the disruption calculations. See also, 
Hennecke, et al., Random circular permutation of DsbA reveals segments that are 
essential for protein folding and stability, J. Mol Biol, 286: 1197-1215 (1999); 
Pachenko, et al., Foldons, protein structural modules, and exons, Proc. Natl Acad. Sci. 

30 USA, 93: 2008-2013 (1996). 



-74- 



WO 03/055978 



PCT/US02/34374 



The optimal parents for experimental methods that restrict the fragmentation 
(such as DNA shuffling, restriction enzyme approaches, exon shuffling) can be 
determined by analyzing the schema disruption profile. The parents, exons, or restriction 
enzymes can be chosen such that the cut points occur at locations in the gene that 
5 minimize the schema disruption. 

FIG. 27A shows the total number of possible crossover locations for each parent 
based on a minimum of six nucleotide overlap between parents. The differences in the 
total number of crossovers correlates with the sequence identity shared between parents. 
For example, parent 1 shares the most sequence identity with parents 2,3, and 4 and 

10 parent 4 shares the least sequence identity with parents 1, 2, and 3. FIG. 27B shows the 
number of crossover points that are consistent with generating a low schema disruption 
(< 30, values from FIG. 25D). Even though the total number of crossover points is 
greater for parent 3, parent 4 has more potential crossover locations that are consistent 
with preserving the schema disruption. This provides an explanation and possible 

15 mechanism for the experimentally-observed absence of parent 3 in the improved 
chimeras previously reported by Crameri et al., 1998. Thus, calculations and 
comparisons of this kind can be used to predict optimal sets of parents for crossover 
recombination. In this calculation example, parent 3 {Yersinia enterocolitica) would not 
be used, because it contributes a relatively high crossover disruption in the schema 

20 disruption profile, in favor of the other parents, which exhibit less crossover disruption. 

6.2 Crossover Recombination of ft- Lactamase-Like Genes bv DNA Shuffling 

This example describes experiments wherein the methods of the invention were 
used to evaluate a crossover probability distribution for a family shuffling experiment 
25 wherein four different P-lactamase-like genes (also referred to as cephalosporinase genes) 
were recombined. (See, Crameri et al Nature, 391 :288 (1 998). 

The three-dimensional structure for the backbone and side chain of the 
cephalosporinase protein expressed by Enterbacter cloacae was retrieved from that 
protein's high resolution crystal structure. Lobkovsky et al, Proc. Natl Acad. Set 
30 U.S.A., 90:11257-11261 (1993). Additional sequence information for the protein was 
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retrieved from the TrEMBL database. (Bairoch & Apweiler, NucL Acids Res., 28:45-48 
(2000) (Accession No. P05364). Sequences for homologous proteins expressed by other 
organisms were also retrieved from the SWISPROT database (Bairoch & Apweiler, 
supra), including sequences for cephalosporinase proteins expressed by Citrobacter 
freundii (Accession No. P05193), Klebsiella pneumonia (Accession No. P048437) and 
Yershinia entercolitica (Accession No. P45460). 
Alignment of parental sequences. 

FIG. 3 is a gene alignment, using GAP, for four P-lactamase-like genes: (1) 
Enterobacter cloacae, (2) Citrobacter freundii, (3) Yersinia enterocolitica and (4) 
Klebsiella pneumonia. SWISPROT or TrEMBL accession numbers for the protein 
sequences and GenBank accession numbers for the DNA sequences are given. DNA 
sequences were retrieved from the GenBank database (Accession Nos. X03966, X07274, 
X63 1 49 and X77455 , respectively). These nucleotide sequences were also aligned, using 
the polypeptide sequence alignment shown in FIG, 3 to align codons of the DNA 
sequences that encoded aligned amino acid residues. 

Generation of crossover mutants. 

A library of possible recombinant mutants was generated in silico from the 
protein alignments using all possible "crossover locations" or "cut points" determined for 
20 the nucleic acid and protein alignments. Specifically, regions of four sequential amino 
acids in a first aligned sequence that were identical at the same positions in another 
aligned DNA sequence were identified as candidate crossover; regions for the affected 
parents. 

In this example, the parameter of four amino acids relates to a minimum required 
25 DNA identity shared between parents for DNA hybridization to occur. On the DNA 
level, six nucleotides of shared identity are required for hybridization to occur. The 
practical reason that the DNA limit (6) is lower than the amino acid limit (4x3=12 
nucleotides) is because multiple codons can encode a single amino acid. This requires 
that a higher threshold be used when calculating the possible crossover points based on 
30 an amino acid alignment. Another approach would be to calculate the thermodynamic 
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energy of hybridization based on the specific base pairs on each parent See Moore et al, 
Predicting crossover generation in DNA shuffling, P7^45 98:6, 3226-3231 (2001). Also, 
melting temperature for the denaturation of the DNA overlap can be calculated based on 
the G-C, and A-T content. In this example, alignments are used to determine where 
5 sequences can reanneal. Alignments are not necessary for the calculations of Examples 
6.1 and 6.3. 

Two exemplary in silico methods were used to generate candidate hybrids or 
crossover mutants, based on the set of possible cut points determined from the alignment 
algorithm. In both methods, parental fragments are cut at the crossover locations which 

10 satisfy a predetermined crossover probability and are randomly recombined at those 
crossover points to produce a pool of recombined or hybrid proteins. 
Method 1 (Random Probability Model of Fragment Extension). 
To generate a candidate crossover mutant, a parent sequence was selected at 
random from the four cephalosporinase sequences. This sequence was written to the 

15 candidate mutant sequence up to a possible cut point. Upon reaching the possible cut 
point, a random number between 0 and 1 was chosen, and if the number was below a 
predetermined crossover probability P c , then a second parent was randomly chosen. 
(Note that because each parent template is randomly selected for extension at a crossover 
point, the second parent could in some cases be the same as the first parent.) The mutant 

20 sequence was then extended from the cut point using the sequence of the second parent 
as a template, up until a next cut point was reached. Then, a random number between 0 
and 1 was again chosen. If this number was below a predetermined crossover probability 
P c , then the mutant sequence was extended from the cut point using another randomly 
selected parent as a template, up to the next cut point. In each case that the random 

25 number was not below a predetermined crossover probability P c , the mutant sequence 
was extended to the next cut point by continuing with the same parent, i.e. without 
crossing over to another parent sequence. The probability P c can be the same or 
different for each cut point. These steps were repeated until the sequence was complete, 
e.g. a full-length hybrid protein was generated, comprising fragments of different parents 

30 recombined at selected cut points. 
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This process was repeated many times, each time with a randomly selected parent, 
until between about 10 4 to 10 6 fall length cephalosporinase crossover mutants were 
generated. 

The crossover probability using Method 1 was based roughly on fragment size 
5 and in this example was selected to P c =0.30. In addition, a further instruction was 
imposed where eachi polypeptide fragment must be at least eight amino acid residues in 
length before another crossover was allowed to occur. The minimum fragment size of 
eight amino acids reflects a lower experimental bound relevant to the Stemmer protocol 
when the beta-lactamase genes were shuffled. In the DNA shuffling protocol, very small 

10 DNA fragments get "lost" in the reaction mixture and cannot become part of a 
recombinant mutant. Thus, this parameter is only relevant for Stemmer-like shuffling 
experiments and is not important for other methods (e.g., StEP has no minimum fragment 
size). This rule is not connected with disruption theory. Using these parameters, the 
average number of fragments per recombinant mutant was 13.4, corresponding to an 

15 average of 80-100 nucleotides per fragment. This was set to model results that were 
previously reported in actual directed evolution experiments. See, FIG. IB, FIG. 12 and 
Crameri et al, Nature, 391:288 (1998). 

Method 2 (Random Probability Model To Generate and Anneal Parent 
Fragments) 

20 An alternative method, Method 2, was also used to generate candidate crossover 

mutants by DNA shuffling. This method is represented diagrammatically by HG. 13, 
As shown by the arrows in FIG. 13A, parental strands are fragmented by randomly 
distributing cut points with probability Pc. In the figure, the arrows mark cut points and 
the thatched lines represent regions of sequence similarity between parents. In FIG. 13B, 

25 a parent is chosen at random to determine the first parental fragment. The next fragment 
is chosen amongst the parents that share adequate sequence identity (including the parent 
of the previous fragment) with equal probability. If the cut point at the end of the parent 
fragment corresponds to an identified crossover location based upon sequence identity, 
as described above, the next fragment is chosen from the pool of eligible parents, 

30 including the parent of the previous parent. This process is repeated until an entire 



-78- 



WO 03/055978 



PCT/US02/34374 



offspring is created. The complete library of recombinant mutants that can be generated 
by the cut pattern shown. FIG. 13C. When this method was utilized to generate 
crossover mutants, the crossover probability in this example, based on fragment size, 
was set at P c =0.15. As in Method 1, a further restriction was imposed so that each 
5 fragment must be at least eight amino acids in length. See also, FIG. 1 A. 

In this experiment, the number of fragments per mutant was 7, and the average 
fragment size was 80-100 nucleotides per fragment. As in Method 1, this is 
approximately the same number that has been previously reported. See, Crameri et al, 
Nature, 391:288 (1998). 

10 The distribution of crossover locations in the resulting library of crossover 

mutants is shown in FIGS. 4A and 4B using Method 1, and FIGS. 4C and 4D using 
Method 2. The graphs provided in these figures indicate the probability, P c , that a mutant 
randomly selected from the library has a crossover point at a given amino acid residue 
(horizontal axis). The solid bars beneath the horizontal axis in this graph indicate 

15 residues where crossover points occurred in actual, functional mutants previously 
identified Crameri et al (1998). 

In this exemplary model, the crossover probability P c is related to fragment size 
and is the same at every residue. A residue is picked at random and a random number is 
chosen between 0 and 1 . If this number is less than P^ then a crossover is "marked" on 

20 the sequence. This is repeated N times, where N is the number of residues. This 
effectively fragments the parent sequences for modeling purposes. The cut points that 
do not correspond to regions of adequate sequence identity are thrown out (FIG. 13) and 
the remaining crossovers are used to create all the possible recombinant mutants. A 
probability distribution for crossovers along the gene can be calculated by taking all the 

25 recombinant mutants generated by the algorithm and keeping track of where the 
crossovers occur. The number of times a crossover is observed is normalized by the total 
number of chimeric mutants generated to obtain a probability in silico. 

Calculating coupling interactions. 
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Using the high resolution crystal structure obtained for the cephasporinase protein 
expressed by Enterobacter cloacae (Lobkovsky, supra), coupling interactions were 
identified for all pairwise combinations of amino acid side chains. Specifically, the 
coupling interactions were delineated for each pair of amino acid residues, i andy, as a 

5 stability parameter e(i, j), with stability corresponding to the calculated energies of 
hydrogen bonds, electrostatic interactions and van der Waals interactions between side 
chains. To calculate such pairwise interactions, the DREIDING force field (Mayo et al. 9 
J. Phys. Chem., 94:8897 (1990)) was used in a modified form that included a hydrogen 
bonding parameter as previously described (Dahiyat et al , Science, 6:1333 (1 997)). Pair- 

10 wise contributions of residues exhibiting interaction energies whose absolute value is 
greater than 0.25 kcal/mol were considered coupled for these examples. 

Pair-wise interactions were also calculated using ORBIT protein design software 
that includedparameters for hydrogen bonds, electrostatic interactions and van der Waals 
interactions between side chains. 

1 5 Determining crossover disruption of mutants. 

Having thus identified pair-wise interactions between residues of the parent 
Enterobacter cloacae cephalosporinase proteins, each of the recombination mutants 
generated in silico was examined to identify coupling interactions that were originally 
present in the parent sequence(s), but had been disrupted in the crossover mutant. This 

20 demonstrates that Methods 1 and 2 generate random pools of mutants in silico that model 
the random pools generated in actual shuffling experiments. According to the invention, 
rules based on coupling interactions are used, as described, to eliminate many of the; 
randomly generated candidates from consideration; i.e. the model focuses on optimum 
candidates which are more likely to exhibit desirable properties. 

25 The average crossover disruption for the pool of crossover mutants shown in FIG. 

4 A was E c = 407. The average crossover disruption for the pool of crossover mutants 
shown in FIG, 4C was E c = 44. 
Screening mutant libraries. 

The in silico crossover mutants were separated into two logical "bins". The first 

30 bin included all crossover mutants generated. The second bin contained those mutants 
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that exhibited a lower level of crossover disruption. In particular, the crossover mutants 
in the second bin had a crossover disruption level that fell below a preselected threshold 
(fi'taJ. The subpools of FIGS. 4B and 4D represent those chimeras that have crossover 
disruptions below the respective thresholds. The average crossover disruption is 
5 compared with a threshold. For instance, the threshold of 75 (FIQ. 4B) was applied to 
the larger pool where the average disruption was 407 (FIG. 4A). The disruption 
threshold of 1 8 (FIG. 4D) was taken from the larger pool where the average disruption 
was 44 (FIG. 4C). In the first case, the smaller pool represents l%ofthe larger pool and 
in the second case, the smaller pool represents 7.5% of the larger pool. The differences 
10 in the average disruption of the two large pools (407 versus 44) reflect several differences 
in the algorithm that was used to generate them. The difference is primarily due to the 
fact that amino acids that are identical in all of the parents were scored as disruptive when 
generating the crossover disruption of the chimeric mutants in FIGS. 4A and 4B (P| = Pj 
= 1.0 in Equation 2). 

15 The distribution of crossover locations that produces minimum amounts of 

crossover disruption is shown in FIGS. 4B and 4D. The calculated probability P c of a 
crossover event occurred at a nucleotide corresponding to each amino acid residue of the 
protein. The crossover probability is determined by counting the number of times a cut 
point occurs as a certain residue in a pool. The pool is generated by sequence identity 

20 alone or the pool the combines sequence identity with disruption. The number of times 
a cut point is observed is divided by the total number of chimeras in the pool. For 
example, if P c (or P (cross) ) of residue 25 is 0.02, this means that.2% of all chimeric mutants 
had a crossovers at residue 25, While the unscreened pool of mutants has an even 
distribution of crossover points in areas of sequence identity between the parental strands 

25 (FIGS. 4 A and 4C), the screened pool has an uneven distribution of crossover points that 
are concentrated at the sequence termini. 

Comparison with Previous Experiments. 

FIGS 4A and 4C show the probability distribution for cut points in P-lactamase 
30 calculated using DNA shuffling methods based upon sequence similarity. The variance 
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in the maximum probability at each crossover is caused by the number of parents that 
share sequence identity at that point. The grey bars beneath the horizontal axis indicate 
actual crossovers that were observed in prior experiments (Crameri et al, Nature,, 
391;288 (1998)). The width of these bars is due to the inability to resolve the cut point 
5 to a single residue, due to the sequence similarity between parents. FIGS. 4B and 4D 
show the probability distribution when the additional constraint of low crossover 
disruption is imposed (E^^. 

As can be readily determined from FIGS. 4A and 4C, the unscreened pools of 
mutant hybrids show even distribution of crossover points in areas of sequence identity 

10 between the parental strands. However, as can be determined from FIGS. 4B and 4D, 
once the total pool is screened for mutants with minimal crossover disruption, the areas 
of parental sequence homology do not exhibit an even distribution of crossover points. 
For the proteins in these examples, computationally determined favorable crossover 
locations are found mainly at the termini of the sequences. These findings paralleled 

1 5 empirical observations (e.g. Crameri, supra,) The few crossover locations not located at 
the termini of the sequence correlated well with data from experiments by Stemmer. See, 
Stemmer, Proa Natl Acad. Set U.S.A, 91:10747 (1994); and Stemmer, Nature, 
370:389 (1994). 

The crossover probability distribution for the family shuffling experiment, in 
20 silico, starting with Citrobacter freundii, Klebseilla pneumoniae, Enterobacter cloacae, 
and Yersinia enterocolitica also corresponded well to previous in vivo experimental data 
where Yersinia enterocolitica was not observed in a pool of mutant offspring. Crameri, 
supra. 

For example, FIG. 6 shows the crossover probability distribution for the family 
25 shuffling experiment with and without Yersinia enterocolitica as a starting parent. The 
dashed gray line represents the probability distribution for combinations of Citrobacter 
freundii, Klebseilla pneumoniae, Enterobacter cloacae. The solid black line is for the 
mutants containing the Yersinia enterocolitica sequence. The inclusion of Yersinia 
enterocolitica to the set of starting parents leads to the creation of a pool of recombinant 
30 mutant offspring with an increased probability of greater disruption of coupling 
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interactions than the pool of recombinant mutant offspring without Yersinia 
enterocolitica. The generation of crossover disruption profiles, also called schema 
profiles, provides a mechanism by which the optimal parents can be determined. 

5 Determination of Optimal Parents Based on an In Silico Chimera Library 

Given the explosive growth of the gene databases due to the exhaustive 
sequencing of large numbers of organisms, sequences of homologous genes are easily 
accessible. Currently, the choice of starting parents for family shuffling is arbitrary or is 
made with minimal information (e.g., availability, sequence similarity). To date, there 

10 is no rigorous method to quantitatively use the information in the sequence databases to 
identify optimal starting parents. For example, in the beta-lactamase experiment 
discussed above, Stemmer and co-workers shuffled four genes, but only three of these 
genes were found in the improved recombinants (Crameri et al., 1998). The Yersinia 
enterocolitica gene (parent 3 - FIG. 27) was not observed in the top mutants. 

1 5 To understand this effect, the methods of the invention were applied to calculate 

all of the possible recombination locations between parents (1) Enterobacter cloacae, (2) 
Citrobacter freundii, (3) Yersinia enterocolitica, and (4) Klebsiella pneumoniae. FIG. 
27. A potential crossover was recorded for every region shared between two parents that 
had six nucleotides in common. For instance, the total number of potential crossovers 

20 forparent 1 is the sum of the number of potential crossovers for parents 1-2, 1-3, and 1-4. 
The differences in the total number of crossovers for each parent reflects the sequence 
identity shared between parents. Because parent 3 shares more sequence identity with 
parents 1 and 2, than parent 4 does with parents 1 and 2, the total number of potential 
crossovers is greater. However, when the additional constraint of having a low schema 

25 disruption is imposed, parent 4 has more potential crossovers than parent 3. This 
provides a mechanism by which parent 4 was observed in the improved chimeras, but not 
parent 3. 
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6.3 Single Cut Point Recombination 

The invention can also be applied to sets of parent biopolymers thai do not share 
sequence identity, and using recombination methods that do not rely on sequence identity. 
At least two methods are known in the art for producing recombinant gene libraries 

5 having cross-overs at any position, regardless of sequence identity. See, e.g., Ostermier 
et al, Nature Biotechnology, 17:1205-1209 (1999); Ostermeier et al, Bioorg. Med, 
Chem., 7:2139-2144 (1999); Sieber et al., Nature Biotechnology, 19, 456-460 (2000). 
In particular, these methods allow genes (and their corresponding polypeptides) that have 
diverged nucleotide sequences to be recombined. However, in the experimental 

10 implementation described here only two parent sequences are recombined with only a 
single cut point (crossover). 

A method of the invention was used to simulate the recombination of PurN and 
GART glycinamide ribonucleotide transfonnylase (Ostermier et al, Nature 
Biotechnology, 17:1205-1209 (1999)). A coupling matrix was calculated using the 

1 5 three-dimensional structure of PurN previously described by Almassy et al (Proa Natl 
Acad. Set U.S.A., 89:61 14 (1992)). The crossover disruption was then calculated for 
each possible single crossover mutant. FIG. 5 provides a plot showing the crossover 
disruption calculated for each mutant, indicated by the amino acid residue of the 
crossover location. 

20 The range of amino acid sequences shown in FIG. 5 (i.e., amino acid residues 

50-150 of the aligned glycinamid ribonucleotide transfonnylase proteins) correspond to 
crossover regions where non-homologous recombinations were previously constructed 
by Benkovic et al Nature Biotechnology, 17: 1205-1209 (1999). Crossover locations for 
functional crossover mutants that were identified in these previous experiments are 

25 indicated on the graph in FIG. 5 by horizontal lines. The vertical lines show the positions 
where single crossovers occurred and led to functional enzymes. The "2" indicates that 
this crossover was sampled twice in the library. The diamonds show where homologous 
recombination(DNA shuffling with single cut points) experiments produced crossovers. 
The calculated crossover disruption decreases rapidly outside of 50-150 amino acid 

30 sequence region indicating that, as expected, crossovers would be strongly biased 
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towards the - and C-termini of the parents. Local crossover disruption minima are also 
present in the region between amino acid residues 50-150 shown in FIG. 5. These 
minima reflect the fact that glycinamid ribonucleotide transformylase proteins comprise 
at least two topologically separate domains. Thus, the local minima in crossover 
5 disruption reflect crossover points which occur at the intersection of such separate 
domains. 

Correlation Between Schema Disruption And Enzyme Activity 

This example demonstrates a correlation between the number of disrupted 
contacts ("schema disruptioa") and enzymatic activity. 

1 0 Hybrid beta-lactamase proteins were made by recombining the TEM-1 and PSE-4 

genes. The TEM-1 gene was obtained from the pSTBlue-1 vector offered by Stratagene. 
It's structure can be retrieved from the Protein Databank (code: 1BTL). ThePSE-4gene 
was obtained from the PMON vector obtained from Richard Levesque. The PMON and 
PSE-4 structures are publically available, e.g. from the Protein Databank (code: 

15 [INSERT]). These two genes share 40% amino acid identity, e.g. by BLAST analysis. 
Recombination was at crossover locations that were predicted by the computational 
algorithm of Example 6.1 . The schema were calculated based on the three-dimensional 
structure and an alignment of the two sequences. Once the schema were identified, the 
number of interactions between the schema were calculated. These calculations are 

20 described in Example 6.5 . 

The oligonucleotide fragments corresponding to the peptide schema were made 
by PCR amplification, where the primers at either end contain a short piece of DNA that 
overlaps with preceding gene fragment (FIG. 28). This overlap ensures that the 
fragments will re-anneal. In this experiment, the number of fragments required to make 

25 a hybrid gene is one more than the number of crossovers. The PCR protocol is to initially 
heat at 95 °C, then perform 25 cycles of heating at 94 °C for 45 seconds, cooling at 52 °C 
for 45 seconds, and extending at 72 °C for 1 min. After the 25 cycles, the sample is held 
at 4 °C. There are many variations of this PCR protocol which would be suitable for 
amplifying the fragments. Each PCR mixture is then run on a 1 % LE agarose gel and the 

30 band corresponding to the fragment is cut. The fragment is then isolated either through 
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ethanol precipitation (for fragments less than 100 bp) or using a Zymoclean-5 gel 
extraction kit (for fragments > 100 bp). 

Once the oligonucleotide fragments are isolated, they are re-annealed to create a 
complete gene fragment through a second PCR amplification step. The forward and 
5 reverse primers have the sequences for the restriction sites of EcoRI and Hindm, 
respectively, so that the complete genes can be inserted into a vector. The times and 
temperatures are identical to the previous amplification round. A pre-PCR step can be 
used to improve the purity of the amplified genes. This PCR protocol is 25 iterations of 
95 °C for 30 seconds, 5 °C for 30 seconds, and 72 °C for 2 min. A final extension of 10 
1 0 min at 72 °C is done after the cycles are complete. The fragments are purified using the 
Zymoclean-5 gel extraction kit. Finally, the fragments are ligated into the PMON vector 
(Sanschagrin F, Theriault E, Sabbagh Y, Voyer N, Levesque RC (2000) 1 Antimicro. 
Chemo. 45: 517-519.), which has been modified to contain the EcoRI and HindHI 
restriction sites. The PMON vector has kanamycin resistance. The vectors containing 
15 the hybrid genes are transformed into XL1-BLUE super competent (>10 9 ) cells and 
grown on plates that contain 10 kanamycin p,g/ml. Colonies are isolated and the vector 
is extracted and sequenced. Some of the recombinant genes contained point mutations 
after the construction process (<l/gene). These mutations were removed using the 
standard Quickchange protocol (Strategene). Once the sequence of the hybrid gene is 
20 confirmed, the vector is retransformed into XL1-BLUE competent (10 6 ) cells and the 
activity of the beta-lactamase is determined. 

Each hybrid beta-lactamase was tested for its activity towards the degradation of , 
the antibiotic ampicillin. To rapidly screen for this property, agar plates are made with 
the following exponentially increasing concentrations of ampicillin: 10, 20, 40, 80, 160, 
25 320, 640, and 1280 |ag/ml. Aliquots of transformed cells are spread on the plates and 
allowed to grow for 24 hours. More active hybrids will grow on plates with greater 
concentrations of ampicillin. The activity is measured as the minimum inhibitory 
concentration (MIC), in other words, the lowest concentration of ampicillin that kills the 
cells. For example, if the cells grow at a concentration of 40 n-g/ml but not 80 jig/ml, the 
30 MIC would be recorded as 80. The XL 1 -BLUE cells naturally have a MIC of 10. Beta- 
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lactamase activity cannot be measured below this point. The wild-type TEM-1 and PSE- 

4 enzymes have MICs of 2560. 

Results 

Six hybrids were constructed with increasing disruption, as determined from the 
5 schema calculation (FIG. 29), Characteristics of the hybrids are shown in the Table 
below. 

Number of Residues Number of Mutations E 



1 17 7 26 

2 21 5 34 
10 3 28 18 36 

4 146 83 58 

5 60 41 68 

6 37 23 100 



1 5 The enzymatic activity of the hybrid proteins decreases dramatically beyond a threshold 
disruption. These experiments demonstrate quantitatively the effect of disruption on the 
properties of the hybrids. This provides useful information in the design of 
recombination experiments. Libraries of hybrid proteins can be created that are within 
the determined threshold. This will maximize the number of hybrids in the library that 

20 are folded and demonstrate activity. Previously, it was difficult to determine the amount 
of disruption that would result in non-functional mutants. The invention provides 
methods for determining and applying this kind of threshold. 

Quantifying the disruption threshold is also useful in predicting the fraction of 
functional hybrids that exist in a library created by random methods. For example, one 

25 version of the computational algorithm calculates a disruption profile for all single- 
crossover recombinants. However, without an understanding of the threshold, this profile 
may not be applied to make useful predictions. Knowledge about threshold behavior 
clarifies the number of crossovers that are predicted to be acceptable. In the case of 
recombining TEM-1 and PSE-4, it has been shown that very few single crossovers will 

30 be acceptable (FIG. 30). 
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6.5 Correlation Between Schema Dis ruption And Enzyme Activity 

According to the invention, the schema theory of genetic algorithms has been 
applied to develop a computational algorithm to identify protein subunits that can be 
recombined without disturbing the integrity of the three-dimensional structure. When 
5 structural integrity is disturbed, recombination is less likely to produce folded or 
functional hybrid proteins. Selection of protein subunits for recombination according to 
algorithms that preserve three-dimensional structure are more likely to produce properly 
folded or functional hybrid protein, are disrupted. Crossovers found by screening 
libraries of randomly shuffled proteins for functional hybrids strongly correlate with those 

10 predicted by this approach. 

In this example, by recombining computationally predicted subunits of two beta- 
lactamase proteins sharing 40% amino acid identity, a .threshold in the amount of 
disruption a resulting hybrid protein can tolerate has been experimentally determined. 
These results demonstrate that the correlation between introns and structural domains 

15 could arise by cycles of recombination and natural selection. From an engineering 
perspective, this approach can be used to optimize in vitro evolution by determining 
optimal crossover locations and starting parents computationally. 

Experimental Design 

20 Ever since the first protein structures were elucidated, researchers have proposed 

ways to divide their otherwise complicated topologies into well-defined substructures, 
or domains (Rossman, M. G. and Liljas, A. (1974) J. Mol Biol 85, 177-181; Crippen, 
G. M. (1978) J. Mol Biol 126, 315-332; Rose, G. D. (1979) /. Mol Biol 134, 447-470; 
Go, 1981; Zehfus, M. H. & Rose, G. D. (1986) Biochemistry 25, 5759-5765; Tsai, C-J., 

25 Maizel, J. V., & Nussinov, R., (2000), Proc. Natl Acad. Sci. USA, 97: 12038-12043). 
Domains are well understood and have been identified according to various criteria, with 
differing experimental results. One approach has been to define a domain as a subunit 
that folds independently (Holm, L. & Sander, C. (1994) Proteins 19, 256-268; 
Panchenko, A. R, Luthey-Schulten, Z. & Wolynes, P. G. (1996) Proc. Natl Acad. Set 

30 USA 93, 2008-2013). This is a very strict approach, as the majority of protein fragments 
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will not fold outside of their structural context. In other studies, the locations of certain 
types of introns were shown to occur at geometrical domain boundaries, suggesting that 
larger proteins are composed of smaller domains discovered earlier in evolution and 
pieced by together by gene duplication and recombination, a scenario known as the 

5 "introns-early" theory (Go, M, (1981), Nature, 291 : 90-92; Go, M, (1983). Proc. Natl 
Acad. Sci. USA, 80: 1964-1968; De Souza, S. J., Long, M, Schoenbach, L., Roy, S. W., 
& Gilbert, W., (1996) Proc. Natl Acad. Set USA, 93: 14632-14636; Gilbert, W., De 
Souza, S. J.,& Long, M., (1997) Proc. Natl Acad. Sci. USA, 94: 7698-7703). However, 
the role of introns in evolution is not well understood. Further, a domain boundary may 

1 0 lack an intron because that domain was not sampled or the intron disappeared over time. 
According to the invention, a connection is made between genetic algorithms and the 
biological role of a domain. The schema algorithms of the invention provide subunits of 
protein structure, or domains, that can be successfully recombined. 

Any of these techniques can be used to work with domains according to the 

1 5 invention, however, the algorithms specifically described herein, and in this example, are 
preferred. More specifically, using recombination to observe that a shuffling event can 
occur, rather than inferring it from the existence of introns, provides a direct approach to 
understanding the subunits which can interchanged to create functional proteins. For 
example, it has been suggested that the optimal recombination points allow swapping of 

20 structural domains (Ranganathan, A., et al, (1999) Chem Biol 6, 731-741; Bogarad, L. 
D. &Deem, M. W. (1999) Proc. Natl Acad. Set USA 96, 2591-2595; Riechmann, L. and 
Winter, G. (2000) Proc, Natl Acad. Sci. USA 97, 10068-10073; Lutz, S. &Benkovic, S.. 
J. (2000) Curr. Opin. Biotech. 11, 319-324). The difficulty has been to predict what 
these smaller building blocks look like. According to the invention, a computational 

25 algorithm is provided, to divide a protein into structural elements or domains that can be 
swapped by recombination. 

Optimal crossover locations correspond to those that result in the combination of 
clusters of bits (schema or "building-blocks") that interact favorably (Holland, J. (1975) 
Adaptation in Natural and Artificial Systems (The University of Michigan Press, Ann 

30 Arbor, MI); Forrest, S. & Mitchell, M. (1993) ia Foundations of Genetic Algorithms 2, 
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ed. Whitley, L. D. (Morgan Kaufinann, San Mateo), pp. 109-126; Mitchell, M. (1996) An 
Introduction to Genetic Algorithms (The MIT Press). It is undesirable that recombination 
divide a schemata such that an offspring inherits fractions of schema from different 
parents. To identify the equivalent schema in proteins, a computational algorithm is 
5 provide (SCHEMA), which predicts elements of protein structure that must be inherited 
from the same parent. These schema will be the building blocks, from which novel 
proteins can be assembled by recombination. 

The SCHEMA algorithm works by calculating the interactions between residues 
and then determining the number of interactions that are disrupted in the creation of a 

1 0 hybrid protein. A disruption occurs when an interaction is broken due to different amino 
acids being inherited from each parent (FIG. 31A). FIG. 31A illustrates a schema 
disruption. Black lines in the structure represent peptide bonds and the small dots are 
interactions between amino acid side chains. Two hybrid proteins are shown. When the 
last four residues come from one parent and the remaining residues come from the other 

1 5 parent, three interactions are disrupted. When the last eight residues come from the same 
parent, then there is no disruption. According to the schema approach of the invention, 
achieving folded hybrid proteins is more likely when the fewest interactions are 
disrupted. FIG, 31B shows the schema disruption profile of the structure in FIG. 31 A, 
calculated using Equation 1 1 with a window size w = 6. 

20 In this example, two residues are considered interacting if any of their atoms 

(excluding hydrogen atoms) are within a cutoff distance d c = 4.0-5.0 angstroms, 
corresponding to approximately 5-8 interactions per residue. The schema disruption E 5 
of a fragment a can then be calculated using the equation: 



where N T is the total number of residues, and N a is the number of residues in fragment a. 
An element of this matrix c y is equal to one if residues i andy are within distance d G 
otherwise it is zero. 




(ii) 



tea jecc 



25 
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Interacting residues for which the amino acid identity is the same in all parents 
cannot contribute to the crossover disruption. When parents are recombined that have 
a high sequence identity, the presence of clusters of interacting amino acids decreases and 
very few crossovers are disruptive. The probability P v in Equation (1 1) scales the 
5 interaction to account for the possibility that the hybrid mutant will have a combination 
of amino acids at residues i and j present in at least one of parents. The probability P (J is 
determined by examining a sequence alignment of the parents and counting the possible 
number of unique amino acid combinations in the hybrid proteins, divided by the total 
number of combinations. 

10 Equation (11) counts the number of interactions that are broken by the 

substitution of a fragment. Ideally, an algorithm would search all possible crossover 
combinations and determine the associated disruption for each. This is possible for 
experiments where only a single crossover is allowed per hybrid (Ostermeier, M., Shim, 
J. H. & Benkovic, S. J. (1999) Nature Biotechnology 17, 1205-1209; Sieber, V., 

15 Martinez, C. A. and Arnold, F. H. (2001) Nature Biotechnology, 19, 456-460). 
Analyzing multiple crossovers by this method leads to combinatorial difficulties, both in 
the calculation and the visualization of the data. Algorithms of the invention overcome 
this limitation. A window of residues is defined and the number of internal interactions 
is counted. The number of internal interactions represents the minimum amount of 

20 disruption realized by a single crossover in the window. In choosing the window size, 
the assumption is made that the probability that two or more crossovers occurring in the 
window is very small. The window is then, slid along the protein structure and a profile 
is generated where the schema disruption of each residue in the window is incremented 
by the amount of disruption created by a crossover in that region. Mathematically, the 

25 schema disruption S t at residue i is expressed as: 




(12) 



j=4-wk=j 
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where w is the window size and $ jk is the number of internal interactions in the window 
that starts at residue j and ends at residue k. Based on Equation 12, a schema disruption 
profile is computed where a large S t value indicates that a residue is involved in a more 
compact unit (FIG. 31B). When crossovers occur in the minima of the schema 
5 disruption profile, the structure is fragmented such that the maximum number of internal 
interactions is preserved, thus minimizing the disruption. 

Representative SCHEMA Calculations 

The SCHEMA algorithm was tested in five experiments where the genetic 
10 information from several low sequence identity parents was shuffled randomly to create 
large libraries of hybrid proteins. In each of the experiments, a screen or selection was 
employed to identify the recombinant mutants that retained function or showed 
improvement. The location of the crossovers identified in the screened libraries was 
compared with with the schema disruption profile. 
15 FIG. 32 shows a comparison of SCHEMA calculations, in a condensed view, 

with crossovers that resulted in functional hybrids of the following (from top to bottom): 
(a) cephalosporinases (Crameri, A., Raillard, S.-A., Bennudez, E. & Stemmer, W. 
P. C. (l99S)Nature 391,288-291; SanschagrinF.,Theriault, E., Sabbagh, Y., Voyer,K, 
and Levesque, R. C, (2000) J. Antimicrob. Chemo. 45, 517-519); 
20 (b) subtilisins (Ness, J. E., Welch, M., Giver, L, Bueno, M., Cherry J. R., 

. Borchert, T. V., Stemmer W. P. C. & Minshull, J. (1999) Nature Biotechnology 17, 
893-896); 

(c) cytochrome P450s (Brock, B. J., & Waterman, M. R., (2000) Arch Biochem. 
Biophys. 373, 401-408; and 
25 (d) transformylases (Ostermeier, M., Shim, J. H. & Benkovic, S. J. (1 999) Nature 

Biotechnology 17, 1205-1209; Lutz, S., Ostermeier, M., & Benkovic, S. J. (2001) Nucl. 
Acid. Res. 9 29,el6). 

The black regions indicate schema and the white regions mark minima in the 
schema disruption profile. Crossovers that occurred in libraries screened for functional 
30 or improved hybrid proteins are indicated by the arrows. Surprisingly, nearly all of the 
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minima of the schema disruption profile are sampled by these experiments, and there are 
very few outliers. The profiles were determined from the following PDB structures: 
1BLS (Lobkovsky, E., Moews, P. C, Liu, H. S., Zhao H. C, Frere J. M. & Knox J. R. 
(1993) Proc. Natl Acad, USA 90, 11257-11261), 1SVN (Betzel, C, Klupsch, S., 
5 Papendorf, G., Hastrup, S., Branner, S., & Wilson, K. S., (1992) /. Mol. Biol. 223, 447), 
1DT6 (Williams, P. A., Cosme, J., Sridhar, V., Johnson, E. R, and Mcree, D. E., (2000) 
Mol Cell, 93: 121), 1CDE (Almassy, R. J., Janson, C. A., Kan, C. C. & Hostomska, Z. 
(1992) Proc, Nail Acad Sci. USA 89, 61 14-61 18). The calculations were run with a 
window size of 15 residues and d c - 4.0 - 5.0 angstroms. 

10 Nearly all of the crossovers occurred in regions that are predicted to be minima 

in the schema disruption profiles. 

In a preferred embodiment, the window size (e.g. Equation 12) that best predicts 
the locations of crossovers in selected libraries is fifteen, which results in domain sizes 
of approximately twenty to thirty residues. This observation is consistent with the 

15 domain sizes that correlate with the size of shuffleable exons (De Souza, S. J., Long, M., 
Schoenbach, L., Roy, S. W., & Gilbert, W., (1996) Proc. Natl Acad. Sci. USA, 93: 
14632-14636). Typically, there are three types of substructures that are predicted to be 
retained in protein evolution: (1) bundles of alpha-helixes, (2) an alpha-helix combined 
with a beta-strand, and (3) beta-strands connected by a hairpin turn. While the algorithm 

20 often finds these structures, there are numerous interesting exceptions. For example, 
crossovers are often predicted in the center of alpha-helixes. Further, while loops can be 
good places for crossovers to occur, they can also be highly disruptive if they divide - 
interacting units of secondary structure. In addition, there are schema that are composed 
of complicated loop-like topologies and no secondary structure. 

25 

Prediction of Crossover Locations 

To test our ability to predict crossover locations, we designed experiments to 
recombine fragments of two beta-lactamases, TEM-1 and PSE-4, using the SOEing 
procedure to piece together fragments by PCR (Horton, R. M., (1995) Mol. Biotech. 3, 
30 93-99). While the proteins in this example have only 40% amino acid sequence identity, 
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they share similar structures (TEM-1 ,SEQIDNO: 5;andPSE-4SEQIDNO:6)(Jelsch, 
C, Mourey, L., Masson, J. M., & Samama, J. P., (1993) Proteins 16, 364; Iim, D., 
Sanschagrin, F., Passmore, L., De Castro, L., Levesque, R. L., & Strynadna N. C. J., 
(2001) Biochemistry 40, 395). 
5 First, the schema disruption profile of beta-lactamase TEM-1 was calculated to 

identify the schema (FIG. 33). Then, the number of interactions between the schema was 
calculated, according to Equation 11. Based on these calculations, recombination 
experiments were designed to test varying levels of disruption in the hybrid proteins, as 
shown in the following Table. 

10 

Designed TEM-l/PSE-4 Hybrid Beta-lactamases 

Cutl Cut2 

Hybrid 8 # Context # Context* m c E 9 MIC 



15 


1A 


163 


loop, surface 


179 


strand, core 


7 


26 


2056 d 




2A 


189 


helix, core 


216 


loop, surface, as 


18 


36 


1028 




2B 














40 




3A 


70 


loop, core, as 


216 


loop, surface, as 


83 


58 






3B 












20 


20 


4A 


70 


loop, core, as 


130 


loop, core, as 


41 


68 


10° 




4B 












10 e 




5A 


254 


loop, surface 






23 


10 
0 


10 e 



a. The letter in the name indicates the parent that composes the first portion of the gene, where 



25 A is PSE-4 and B is TEM-1. For the double cut mutants, an *A' indicates a gene structure of 
A-B-A and 4 B» indicates B-A-B. 

b. The context of the side chain of the residue where the cut occurs. The notation "as" 
indicates that the crossover occurs near the active site. 

c. The number of unique mutations that occur when the smaller fragment of one parent is 
30 inserted into the larger context of the remaining parent. 

d Wild-type activity of both PSE-4 and TEM-1 

e. The MIC of XL1-BLUE cells. No beta-lactamase activity is observed. 

35 The interactions between schema for beta-lactamase are calculated according toEquation 
1 1 . In this example, schema can be based on the interaction strength between the various 
subunits. For example, subunits can be grouped according to strong interactions, e.g. E 8 
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> 19, medium interaction, e.g. 10 < E, < 19, and weaker interactions, e.g. E s < 9. 

To test the sequence dependence of the schema disruption, the sequence mirror 
of each hybrid was constructed. For instance, for a two-crossover hybrid (three 
fragments), the hybrid was also constructed where the first fragment is from PSE-4 
5 (labeled 'A') as well as the hybrid where the first fragment is from TEM-1 (labeled *B'). 
Thus, both "A-B-A" and "B-A-B" hybrids are constructed. In addition, a hybrid was 
made that has been determined previously to have wild-type activity (Hybrid 1 A) as a 
positive control (Sanschagrin et al, 2000) and a hybrid where the crossover occurs in a 
maximum of the schema disruption profile as a negative control (Hybrid 5 A). 

10 Each hybrid was tested for protein activity by measuring the minimum 

concentration of ampicillin required to inhibit cell growth (MIC). A higher MIC value 
indicates that the hybrid is more active. Wild-type TEM-1 and PSE-4 are highly active 
towards ampicillin (MIC 2048 mg/ml) and have similar activities towards various beta- 
lactam substrates (Sanschagrin et al 9 2000). Testing the MIC of each hybrid protein, we 

15 found a sharp transition in the amount of disruption that is tolerated (BIG. 34). When the 
disruption increases beyond this threshold, the hybrid proteins showed no activity. This 
trend does not correlate with the number of mutations that effectively occur when the 
hybrid is constructed (See, Table above). 

The five hybrids in the Table that show activity (1 A, 2A, 2B, 3A, 3B) have 

20 interesting characteristics. All have at least one crossover at a burled position. 
Additionally, a crossover occurs in the middle of a helix for two hybrids (3 A, 3B) and at 
the end of a beta-strand in hybrid 1 A. Finally, four of the hybrids (2A, 2B, 3 A, 3B) have 
crossovers near the active site. Notably, the negative control (5 A) has a crossover that 
occurs in a loop on the surface and only a few residues are recombined near the terminus. 

25 Naively, the crossovers of hybrid 5 A could be predicted to be non-disruptive, yet the 
algorithm of the invention correctly identified them to be disruptive. 

These hybrids demonstrate the wide range of possible start and end points for 
schema and their context dependence. Go originally discovered a correlation between the 
location of introns and structural domains, a correlation that has held for a wide range of 

30 proteins (Go, M. (1981), Nature, 291: 90-92; Go, M. (1983). Proc. Natl Acad, Sci. USA, 
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80: 1964-1968). This correlation has been interpreted as evidence for the "introns-early" 
theory of evolution, where the first large proteins were constructed from smaller 
shuffleable subunits through recombination and gene duplication (De Souza, 1996; 
Gilbert et ai 9 1997). As a result, regions of non-coding DNA (introns) separated the 
5 subunits in the genome. Over evolutionary time, the introns disappeared where they were 
not necessary or were disadvantageous, for example, in the restricted genome sizes of 
prokaryotes. Proponents of this theory have argued that if introns appeared late in 
evolution, their locations would appear random with respect to structural domains 
(Gilbert etal, 1997). 

10 This example and its experiments demonstrate that the correlation between 

introns and domains could occur as a result of natural selection, even if the introns 
appeared late. Of the many proposed functions of introns, one is that they facilitate the 
swapping of exons (citation). If the probability of a crossover is equal across the gene, 
then a long region of non-coding DNA would bias the crossovers towards a specific 

1 5 region of the fully constructed gene. Cycles of recombination and selection can bias the 
location of introns if the fitness of an organism is reliant on the ability of an intron to 
promote shuffling. If, in a population of these organisms, introns were randomly 
distributed throughout the gene, then there would be a selective advantage towards those 
individuals that had introns in regions that are the most likely to result in successful 

20 shuffling events. 

Directed evolution has been used to observe this process as it progresses. When 
crossovers are randomly distributed throughout the gene, those that result in the 
preservation of schema are the most likely to result in folded, functional hybrids. 
Therefore, if introns were to appear to promote recombination, they would most likely 

25 reside in low-disruption regions after selection. 

Recombination is a powerful tool for optimization. It promotes the combination 
of traits from multiple parents onto a single offspring, thus exploiting information 
obtained in previous rounds of selection (Holland, 1975). A significant application of the 
invention is to accelerate molecular optimization by laboratory evolution methods 

30 (Stemmer, 1 994; Crameri et al., 1998;) through the use of computational tools (Voigt, C. 
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A., Mayo, S. L., Arnold, F. H., & Wang, Z-G. (2001) Proa Natl Acad. Set USA 98, 
3778-3783; Voigt, C. A., Kauffinan, S., & Wang, Z-G. (2001), Advances in Protein 
Chemistry 55, 79-160). As opposed to randomly generating crossovers, combinatorial 
libraries with targeted crossovers improve the probability of finding functional 
5 improvements in a library. Practically, this 
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significantly reduces the number of mutants that must be screened. The elucidation and 
experimental verification of evolutionary dynamics allows the design of a new generation 
of evolutionary methods that maximize our a^^ 
for pharmaceutical and industrial applications. 
5 It will be appreciated by persons of ordinary skill in the art that the examples 

herein are illustrative only, and do not limit the scope of the invention or the 
accompanying claims. 
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WE CLAIM ; 

1 . A method for selecting a crossover location in a first biopolymer having a first 
polymer sequence, for recombination with one or more second biopolymers each having 

5 its own second polymer sequence, which method comprises: 

identifying coupling interactions between pairs of residues in the first polymer 
sequence; 

generating a plurality of data structures, each data structure representing a 
crossover mutant comprising a recombination of the first and a second polymer sequence 
10 wherein each recombination has a different crossover location; 

determining, for each data structure, a crossover disruption related to the number 
of coupling interactions disrupted in the crossover mutant represented by the data 
structure; and 

identifying, among the plurality of data structures, a particular data structure 
1 5 having a crossover disruption below a threshold, 

wherein the crossover location of the crossover mutant represented by the 
particular data structure is the identified crossover location. 

2. A method of claim 1, wherein the particular polymer sequence comprises a 
20 sequence of amino acid residues. 

3. A method of claim 1, wherein the particular polymer sequence comprises a 
sequence of nucleotide residues. 

25 4. A method of claim 1, wherein coupling interactions are identified by use of a 
coupling matrix. 

5. A method of claim 1, wherein the coupling matrix is the summation of all the 
coupling interactions of the first polymer sequence. 

30 
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6. A method of claim 1, wherein coupling interactions are identified by a 
determination of a conformational energy between residues. 

7. A method of claim 1, wherein coupling interactions are identified by a 
5 determination of interatomic distances between residues. 

8. A method of claim 6, wherein conformational energies for each of the first and 
second polymer sequences are determined from a three-dimensional structure for at least 
one of the first and second polymer sequences. 

10 

9. A method of claim 7, wherein interatomic distances for each of the first and 
second polymer sequences are determined from a three-dimensional structure for at least 
one of the first and second polymer sequences. 

15 10. A method of claim 2, wherein coupling interactions are identified by a 
conformational energy between residues above a threshold. 

11. A method of claim 1, wherein a coupling interaction between a pair of residues 
in the first polymer sequence is disrupted in a crossover mutant wherein a coupling 

20 interaction between a pair of residues is disrupted in a crossoverniutant if the identity of 
both residues participating in the coupling interaction is different than that which exists 
in any of the parents. 

12. A method of claim 8, wherein a coupling interaction between a pair of residues 
25 in the first polymer sequence is disrupted in a crossover mutant wherein a coupling 

interaction between a pair of residues is disrupted in a crossover mutant if the identity of 
both residues participating in the coupling interaction is different than that which exists 
in any of the parents. 

30 13. A method of claim 1, wherein the crossover disruption is the summation of all 
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coupled interactions in the parent that are considered disrupted in the data structure 
representing the crossover mutant. 

14. A method of claim 1, wherein the threshold is an average level of crossover 
5 disruption for the plurality of data structures. 

15. A method of claim 1, wherein the threshold is at least one standard deviation 
below the average level for the plurality of data structures. 

10 16. A method of claim 1 , wherein the threshold is set so that approximately 7.5% of 
the total number of generated data structures is below the threshold. 

17. A metho d of claim 1 , wherein the threshold is set so that approximately 1 % of the 
total number of generated data structures is below the threshold. 

15 

18. A method of claim 1, wherein the threshold is set so that approximately 0.001% 
of the total number of generated data structures is below the threshold. 

19. A method of claiml, wherein the generation of crossover mutants comprises: 
the sequence alignment of a plurality of biopolymers; 

the identification of possible cut points in the biopolymer based upon regions of 
sequence identity identified by the sequence alignment; and 

the generation of single crossover mutants based upon the identified possible cut 

points. 

20. A method of claim 1 9, wherein the regions of sequence identity must contain at 
least 4 residues. 

21. A method of claim 19, there must be at least eight residues between crossovers. 



20 
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22. A method of claim 1 , wherein the generation of the plurality of data structures 
comprises: 

the sequence alignment of a plurality of biopolymers using simulated annealing 
with non-homologous parents; 
5 selecting crossover locations based upon the minimization of crossover 

disruption, fragment size, starting number of parents; and 

the generation of a plurality of data structures based upon the identified possible 
crossover locations. 

10 23 . A method of claim 1 , wherein the generation of the plurality of data structures 
comprises: 

choosing one of the biopolymers from the plurality of biopolymers at random; 

copying the biopolymer until a possible crossover location is reached; 

choosing a random number between 0 and 1 ; 
15 choosing a new biopolymer from the plurality of biopolymers to copy to the 

offspring if the random number is below a crossover probability (PJ; and 

repeating the above process until the data structure representing the crossover 
mutant is the desired length. 

20 24. A method of claim 19, wherein the generation of the plurality of data structures 
based upon identified cut points comprises: 

cutting the biopolymers in into biopolymer fragments by randomly assigning cut 
points with a set probability, 

randomly choosing one of the biopolymer fragments as a starting parent; 
25 randomly identifying another biopolymer fragment from the total pool of the 

biopolymer fragments; 

ligating the identified biopolymer fragment to the parent fragment, if the 
identified fragment has a sequence identity cut-point at the end of the fragment; and 
repeating the randomly identifying step until the data structure, representing the 
30 crossover mutant is the desired length. 
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25. A method for directed evolution of a polymer, which method comprises steps of: 
providing a plurality of parent polymer sequences; 

identifying crossover locations in the parent polymer sequences for recombination 
according to claim 1 ; 

5 generating one or more mutant polymer sequences utilizing recombinatory 

techniques targeted at the identified crossover locations on the parent polymer sequences; 

screening the one or more mutant sequences for the one or more properties of 
interest; and 

selecting at least one mutant sequence where one or more properties of interest 
10 are identified. 

26. A method according to claim 25, wherein the method is iteratively repeated, and 
wherein at least one mutant sequence selected in a first iteration is a parent sequence in 
a second iteration. 

15 

27. A method of claim 25, wherein the recombination techniques are selected from 
the group consisting of: DNA shuffling, StEP method, fragmentation and reassembly, 
synthesis, and random-priming recombination. 

28. A computer system for analyzing a polymer sequence, which computer system 
comprises: 

memory and a processor interconnected with the memory and having one or more 
software components loaded therein, wherein the one or more software components cause 
the processor to execute steps of a method according to claim 1. 

29. A computer system of claim 28, wherein the software components comprise a 
database of polymer sequences. 

30. A computer system of claim 28, wherein the software components comprise a 
database of three-dimensional structures for polymer sequences. 
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31. A computer program comprising a computer readable medium having one or 
more software components encoded in computer readable form, wherein the one or more 
software components may be loaded into a memory of a computer system and cause a 
processor interconnected with the memory to execute steps of a method according to 
claim 1. 

32. A computer program according to claim 30, wherein the computer readable 
medium further has, encoded thereon in computer readable form, a database of polymer 
sequences. 

33. A computer program according to claim 30, wherein the computer readable 
medium further has, encoded thereon in computer readable form, a database of three- 
dimensional structures for polymer sequences. 

34. A computer system for analyzing a polymer sequence, which computer system 
comprises: 

memory and a processor interconnected with the memory and having one or more 
software components loaded therein, wherein the one or more software components cause 
the processor to execute steps of a method according to claim 19. 

35. A computer program comprising a computer readable medium having one or 
more software components encoded in computer readable form, wherein the one or more 
software components may be loaded into a memory of a computer system and cause a 
processor interconnected with the memory to execute steps of a method according to 
claim 19. 

36. A computer system for analyzing a polymer sequence, which computer system 
comprises: 

memory and aprocessor interconnected with the memory and having one ormore 
software components loaded therein, wherein the one or more software components cause 
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the processor to execute steps of a method according to claim 23. 

37. A computer program comprising a computer readable medium having one or 
more software components encoded in computer readable form, wherein the one or more 
software components may be loaded into a memory of a computer system and cause a 
processor interconnected with the memory to execute steps of a method according to 
claim 23. 

38. A computer system for analyzing a polymer sequence, which computer system 
comprises: 

memory and a processor interconnected with the memory and having one or more 
software components loaded therein, wherein the one or more software components cause 
the processor to execute steps of a method according to claim 24. 

39. A computer program comprising a computer readable medium having one or 
more software components encoded in computer readable form, wherein the one or more 
software components may be loaded into a memory of a computer system and cause a 
processor interconnected with the memory to execute steps of a method according to 
claim 24. 

40. A computer system for analyzing a polymer sequence, which computer system 
comprises: 

memory and a processor interconnected with the memory and having one or more 
software components loaded therein, wherein the one or more software components cause 
the processor to execute steps of a method according to claim 25. 

41. A method for producing hybrid polymers from two or more parent polymers 
comprising the steps of: 

identifying structural domains of at least one parent polymer; 
organizing identified domains into schema; 
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calculating a schema disruption profile; 

selecting at least one crossover location based on the schema disruption profile; 

and 

recombining two or more parent polymers at one or more selected crossover 
5 locations to produce at least one hybrid polymer. 

42. A method of claim 41 , wherein parent polymers are recombined in silico, in vitro, 
in vivo, or in any combination thereof. 

10 43. A method of claim 41, wherein parent polymers are recombined in silico to 

produce at least one candidate hybrid polymer. 

44. A method of claim 43 , wherein parent polymers are physically recombined at one 
or more crossover locations, including at least one selected crossover location, to produce 

15 at least one hybrid polymer corresponding to a candidate hybrid polymer. 

45. A method of claim 44, wherein parent polymers are physically recombined in 
vitro, 

46. A method of claim 44, wherein parent polymers are physically recombined in 
vivo. 

47. A method of claim 41 , wherein each parent polymer comprises a polypeptide. 

48. A method of claim 44, wherein each parent polymer comprises a polypeptide. 

49. A method of claim 41, wherein each parent polymer comprises an 
oligonucleotide. 

30 50. A method of claim 44, wherein each parent polymer comprises an 
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oligonucleotide. 

51. A method of claim 44, wherein the parent polymers are one of polypeptides and 
oligonucleotides, and wherein parent polymers are recombined in a directed evolution 

5 experiment. 

52. A method of claim 44, comprising the step of screening hybrid polymers for one 
or more properties. 

10 53. A method of claim 51, comprising the step of screening hybrid polymers forone 

or more properties. 

54. A method of claim 5 1 , wherein the directed evolution experiment includes at least 
one protocol selected from the group consisting of fragmentation and reassembly, family 

15 shuffling, exon shuffling, StEP, ITCHY, synthesis techniques, and PCR-based 

techniques. 

55. A method of claim 44, wherein hybrid polymers are expressed by host cells. 
20 56 A method of claim 41~ wherein hybrid "polymers are expressed by host cells: 

57. A method of claim 53, wherein hybrid polymers are expressed by host cells. 

58. A method of claim 41, wherein crossover locations are selected from a schema 
25 disruption profile based on a prediction that the selected crossovers will tend to produce 

relatively less schema disruption than other crossover locations. 

59. A method of claim 44, wherein crossover locations are selected from a schema 
disruption profile based on a prediction that the selected crossovers will tend to produce 

30 relatively less schema disruption than other crossover locations. 
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60. A method of claim 41, wherein crossover locations are selected based on a 
schema disruption threshold. 

61. A method of claim 44, wherein crossover locations are selected based on a 
schema disruption threshold. 

62. A method of claim 51, wherein crossover locations are selected based on a 
schema disruption threshold. 

63. A method of claim 41, wherein crossover locations are selected to preserve 
schema from at least one parent polymer. 

64. A method of claim 41, wherein crossover locations are selected to preserve 
schema from a plurality of parent polymers. 

65. A method of claim 44, wherein crossover locations are selected to preserve 
schema from at least one parent polymer. 

66. A method of claim 44, wherein crossover locations are selected to preserve 
schema from a plurality of parent polymers. 

67. A method of claim 51, wherein crossover locations are selected to preserve 
schema from at least one parent polymer. 

68. A method of claim 51, wherein crossover locations are selected to preserve 
schema from a plurality of parent polymers. 

69. A method of claim 44, wherein a library of candidate hybrid polymers is 
compared with a library of physically recombined hybrid polymers. 
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70. A method of claim 51, wherein the sequence space of a directed evolution 
experiment is reduced based on a library of in silico candidate hybrid candidate 
sequences. 

71 . A method for producing a library of hybrid polymers comprising the steps of: 
choosing two or more parent polymers; 

identifying structural domains of at least one parent polymer; 
organizing identified domains into schema; 
calculating a schema disruption profile; 

selecting crossover locations based on the schema disruption profile; 

recombining two or more parent polymers at one or more selected crossover 
locations to produce a set of hybrid polymers; 

repeating at least the choosing and recombining steps to produce at least one 
additional set of hybrid polymers; and 

generating a library of hybrid polymers from the sets of hybrid polymers. 

72. A method of claim 71 , wherein the repeated steps comprise choosing at least one 
hybrid polymer as a parent polymer. 

73. A method of claim 71, wherein recombining steps af rpMoriried in silico. 

74. A method of claim 73, further comprising physically recombining parent 
polymers at selected crossover locations to produce hybrids in the library. 

75. A method of claim 71, wherein schema are common to at least two parents. 

76. A method of claim 74, wherein schema are common to at least two parents. 

77. A method of claim 71, wherein a schema disruption profile is calculated based 
on one or both of conformational energy and interatomic distances. 
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78. A method of claim 75, wherein a schema disruption profile is calculated based 
on one or both of conformational energy and interatomic distances. 

79. A method of claim 73, wherein parent polymers are physically recombined in a 
5 directed evolution experiment. 

80. A method of claim 79, wherein the directed evolution experiment includes at least 
one protocol selected from the group consisting of fragmentation and reassembly, family 
shuffling, exon shuffling, StEP, ITCHY, synthesis techniques, and PCR-based 

10 techniques. 

81. A method of claim 74, further comprising screening hybrids in the library for one 
or more properties. 

15 82. A method of claim 73, further comprising physically recombining parent 

polymers at selected crossover locations to produce hybrids in the library and screening 
hybrids in the library for one or more properties; and wherein the repeated steps comprise . 
choosing at least one hybrid polymer as a parent polymer based on screening results. 

20 83. A method of claim 41, wherein schema comprise domains identified according to" 

sequence alignments between two or more parent polymers. 

84. A method of claim 71, wherein schema comprise domains identified according to 
sequence alignments between two or more parent polymers. 

25 

85. A method of claim 41, wherein the crossover location comprises a crossover region. 

86. A method of claim 71, wherein the crossover location comprises a crossover region. 
30 87. A method of claim 41, wherein the schema disruption profile comprises fitness 
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contributions of polymer residues of one or more parent polymers. 

88. A method of claim 71, wherein the schema disruption profile comprises fitness 
contributions of polymer residues of one or more parent polymers. 

5 

89. A method of claim 41, further comprising the step of calculating a crossover 
disruption profile. 

90. A method of claim 71, further comprising the step of calculating a crossover 
10 disruption profile. 

91. A method of claim 41, further comprising restricting the selection of crossover 
locations based on at least one predetermined constraint. 

15 92. A method of claim 91, wherein the predetermined constraint is based on a protocol 

for physically recombining the polymers. 

93. A method of claim 92, wherein the predetermined constraint comprises at least one 
of a requirement of sequence identity between parents, a constraint on the number of 

20 crossovers, and a constraint on the location of crossovers. 

94. A method of claim 41, further comprising the steps of generating a coupling matrix 
and using the matrix in at least one of the identifying, organizing, calculating, and 
selecting steps. 

25 

95. A method of claim 71, further comprising the steps of generating a coupling matrix 
and using the matrix in at least one of the identifying, organizing, calculating, and 
selecting steps. 

30 96. A method of claim 41, wherein domains are identified based on sequence 
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information for at least one parent polymer. 

97. A method of claim 71, wherein domains are identified based on sequence 
information for at least one parent polymer. 

98. A method of claim 41, wherein domains are identified based on a crystal structure 
for at least one parent polymer. 

99. A method of claim 71, wherein domains are identified based on a crystal structure 
for at least one parent polymer. 

100. A method of claim 71, wherein crossover locations are selected from a schema 
disruption profile based on a threshold disruption value. 

101. A method for modeling the recombination of two or more parent polymers 
comprising the steps of: 

obtaining structural information for at least one parent polymer; 
evaluating coupling interactions betweenpolymer residues based on the structural 
information; 

identifying domains based on the determined coupling interactions; 
calculating the crossover disruption of the identified domains to produce a 
disruption profile; 

applying a predetermined threshold disruption to each domain of the disruption 

profile; 

at least one of, accepting domains which satisfy the threshold and rejecting 
domains which do not satisfy the threshold; 

repeating at least the identifying, calculating and applying steps until each 
identified domain is accepted or rejected; 

designating the accepted or rejected domains as disruptive; 

selecting crossover regions from domains that are not designated as disruptive; 
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and 

recombining parent polymers at selected crossover regions. 

102. A method of claim 101, wherein the step of identifying domains comprises 
determining the polymer residues which belong to each domain, and the step of selecting 
crossover regions comprises specifying one or more residues within at least one non- 
disruptive domain. 

103. A method of claim 101, wherein the threshold disruption represents a maximum 
allowable disruption, domains having a disruption above the threshold are accepted as 
disruptive and are preserved, domains having a disruption below the threshold are 
rejected as non-disruptive and maybe altered, and crossover regions are selected from 
residues belonging to non-disruptive domains. 

104. Amethod of claim 1 03, wherein domains having a disruption equal to the threshold 
are one of accepted as disruptive or rejected as non-disruptive. 

105. A method of claim 102, wherein the selection of crossover regions is restricted 
according to one or more recombination constraints. 

106. A method of claim 104, wherein the selection of crossover regions is restricted 
according to one or more recombination constraints. 

107. A method of claim 105, wherein the constraint comprises at least one of a 
requirement of sequence identity between parents, a constraint on the number of 
crossovers, and a constraint on the location of crossovers. 

108. A method of claim 106, wherein the constraint comprises at least one of a 
requirement of sequence identity between parents, a constraint on the number of 
crossovers, and a constraint on the location of crossovers. 



-113- 



WO 03/055978 



PCT/US02/34374 



109. A method of claim 105, wherein the constraint comprises arequirementof sequence 
identity between parents, and the method further comprises: 

obtaining sequence information for the parent polymers; 

aligning the obtained sequence information; and 

identifying cut points within aligned regions of the parent sequences. 

1 1 0. A method of claim 1 09, where the step of identifying cut points comprises selecting 
cut points having a relatively low crossover disruption, and the step of specifying a set 
of parental fragments for recombination based on selected cut points. 

1 1 1 . A method of claim 44, wherein parental polymers are genes, and the polymers are 
physically recombined by a staggered extension process (StEP) comprising the steps of: 

specifying one or more selected crossover locations; 

cutting each of two ormoreparentpolymers within oneormore crossover regions 
that each encompass one or more specified crossover locations to define a set of polymer 
fragments; 

producing a set of defined polymer fragments, wherein each fragment has an end 
primer comprising a sequence with residues that extend past a specified crossover 
location; and 

assembling at least one of pair of fragments having sequences which overlap an 
end primer of at least one fragment of the pair, to produce a recombinant polymer. 

1 12. A method of claim 111, wherein the producing step comprises synthesizing two or 
more fragments. 

113. A method of claim 112, wherein synthesizing fragments comprises split pool 
synthesis. 

1 14. A method of claim 111, wherein fragments are assembled by extension from an end 
primer. 
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115. A method of claim 111, wherein the set of defined polymer fragments comprises 
all of the fragments arising from cutting all of the parent polymers within all of the 
crossover regions that encompass all of the specified crossover locations. 

116. A method of claim 115, wherein all of the fragments are assembled in all of the 
possible combinations. 

117. A method of claim 111, further comprising the step of screening one or more 
recombinant polymers for a property. 



1 1 8. A method of claim 74, wherein parental polymers are genes, and the polymers are 
physicallyrecombinedbyastaggered extension process (StEP) comprising the steps of: 

specifying one or more selected crossover locations; 

cutting eachoftwoormoreparentpolymers within oneormorecrossoverregions 
15 that each encompass one or more specified crossover locations to define a set of polymer 

fragments; 

producing a set of defined polymer fragments, wherein each fragment has an end 
primer comprising a sequence with residues that extend past a specified crossover 
location; 

20 assembling at least one of pair of fragments having sequences "which overlap ah 

end primer of at least one fragment of the pair, to produce a recombinant polymer. 

119. A method of claim 11 8, wherein fragments are assembled by extension from an end 
primer. 



120. A method of claim 1 1 1, wherein the set of defined polymer fragments comprises 
all of the fragments arising from cutting all of the parent polymers within all of the 
crossover regions that encompass all of the specified crossover locations; and wherein 
all of the fragments are assembled in all of the possible combinations. 



30 
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121 . A method of claim 44, wherein the parental polymers are genes, and the polymers 
are physically recombined by an in vitro-in vivo recombination method comprising the 
steps of: 

shuffling at least two parent polymers to produce a set of parental fragments 
having selected crossover locations; 

assembling fragments at crossover locations by overlap extension and gap repair, 
to provide double stranded sequences containing mismatched regions; and 

repairing the mismatched regions in vivo by inserting the double-stranded 
sequences into a host cell to provide a library of crossover recombinants. 

122. A method of claim 121, wherein the double stranded sequences are inserted into a 
host cell in the form of a heteroduplex plasmid. 

123. A method of claim 121, wherein parental homoduplexes are removed. 

124. A method of claim 44, wherein the parental polymers are genes, and the polymers 
are physically recombined by an in vitro-in vivo recombination method comprising the 
steps of: 

specifying one or more selected cut points; 

preparing synthetic polymer fragments having sequences corresponding io the 
sequences of parent polymers that are cut at specified cut points; 

extending the sequence of each fragment at a outpoint against a parental template 
to produce a set of polymer duplexes representing different combinations of fragments; 

removing parent homoduplex polymers; and 

providing a set of recombinants from the resulting heteroduplex polymers. 

125. A method of claim 124, wherein parent homoduplexes are removed by inserting the 
polymer duplexes into a host cell. 

126. A method of claim 125, wherein the polymer duplexes are inserted into a host cell 
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in the form of a heteroduplex plasmid. 

127. A method of claim 74, wherein the parental polymers are genes, and the polymers 
are physically recombined by an in vitro-in vivo recombination method comprising the 
steps of: 

specifying one or more selected cut points; 

providing polymer fragments having sequences corresponding to the sequences 
of parent polymers that are cut at specified cut points; 

extending the sequence of each fragment at a cut point against aparental template 
to produce a set of polymer duplexes representing different combinations of fragments; 

removing parent homoduplex polymers; and 

providing a set of recombinants from the resulting heteroduplex polymers. 

128. A method of claim 127, wherein parent homoduplexes are removed by inserting the 
polymer duplexes into a host cell. 

129. A method of claim 128, wherein the polymer duplexes are inserted into a host cell 
in the form of a heteroduplex plasmid. 

130. A method of claim 127, further comprising the step'of screening one or more 
recombinant polymers for a property. 

131. A method of claim 44, wherein the parental polymers are genes, and the polymers 
are physically recombined by a PCR amplification method comprising the steps of: 

specifying one or more selected cut points; 

defining polymer fragments having sequences corresponding to the sequences of 
parent polymers that are cut at specified cut points; 

providing sets of primers, wherein each primer in a set hybridizes to all parent 
strands at a crossover region corresponding to a specified cut point; 

producing a set of defined fragments from each parent polymer by PCR 
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amplification with each set of primers; and 

assembling fragments in a pool by PCR amplification. 

132. A method of claim 131, wherein: 

each set of primers is a pair of terminal primers or a pair of intervening primers; 
each primer in a terminal pair of primers corresponds to at least one terminal end 

of one parent polymer; and 

each primer in each intervening pair of primers corresponds to a specified cut 

point. 

133. A method of claim 132, wherein PCR amplification is performed using a first 
primer selected a first pair of primers, and a second primer selected from a second pair 
of primers. 

134. A method of claim 133, wherein the first and second primers flank the ends of a 
polymer fragment. 

135. A method of claim 74, wherein the parental polymers are genes, and the polymers 
are physically recombined by a PCR amplification method comprising the steps of: 

specifying one or more selected cut points; 

defining polymer fragments having sequences corresponding to the sequences of 
parent polymers that are cut at specified cut points; 

providing sets of primers, wherein each primer in a set hybridizes to all parent 
strands at a crossover region corresponding to a specified cut point; 

producing a set of defined fragments from each parent polymer by PCR 
amplification with each set of primers; and 

assembling fragments in a pool by PCR amplification. 

136. A method of claim 135, wherein: 

each set of primers is a pair of terminal primers or a pair of intervening primers; 
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each primer in a terminal pair of primers corresponds to at least one terminal end 

of one parent polymer; and 

each primer in each intervening pair of primers corresponds to a specified cut 

point. 

137. A method of claim 136, wherein PCR amplification is performed using a first 
primer selected a first pair of primers, and a second primer selected from a second pair 
of primers. 

138. A method of claim 137, wherein the first and second primers flank the ends of a 
polymer fragment. 

139. A method of claim 131, further comprising the step of screening one or more 
recomhinant polymers for a property. 

140. A method of claim 44, wherein the parental polymers are genes, and the polymers 
are physically recombined by a family shuffling method comprising the steps of: 

specifying one or more selected crossover locations; 

providing sets of primer pairs, wherein each primer of each pair comprises 
sequences from two parent polymers which span and include a specified crossover" 
location; 

producing fragments of the parent polymers; 

reassembling the fragments in the presence of the primers using PCR 
amplification. 

141. A method of claim 74, wherein the parental polymers are genes, and the polymers 
are physically recombined by a family shuffling method comprising the steps of: 

specifying one or more selected crossover locations; 

providing sets of primer pairs, wherein each primer of each pair comprises 
sequences from two parent polymers which span and include a specified crossover 
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location; 

producing fragments of the parent polymers; 

reassembling the fragments in the presence of the primers using PCR 
amplification. 

5 

142. A method of producing recombinant oligonucleotides from two or more parent 
oligonucleotides by a staggered extension process comprising the steps of: 

selecting one or more crossover locations for each parent oligonucleotide; 

cutting each of two or more parents within one or more crossover regions that 
10 each encompass one or more specified crossover locations to define a set of fragments; 

producing a set of defined fragments, wherein each fragment has an end primer 
comprising a sequence with residues that extend past a specified crossover location; and 

assembling at least one of pair of fragments having sequences which overlap an 
endprimerofatleastonefragmentofmepan.toproducearecombinantoUgonucleotide. 

15 

143. A method of producing recombinant oligonucleotides from two or more parent 
oligonucleotides by an in vitro-in vivo recombination method comprising the steps of: 

selecting one or more crossover locations for each parent oligonucleotide; 
shuffling at least two parent oligonucleotides to produce a set of fragments having 

20 selected crossover locations; 

assembling fragments at crossover locations by overlap extension and gap repair, 
to provide double stranded sequences containing mismatched regions; and 

repairing the mismatched regions in vivo by inserting the double-stranded 
sequences into a host cell to provide a library of crossover recombinants. 

25 

144. A method of producing recombinant oligonucleotides from two or more parent 
oligonucleotides by an in vitro-in vivo recombination method comprising the steps of: 

specifying one or more selected cut points for each parent oligonucleotide; 
preparing synthetic polymer fragments having sequences corresponding to the 
30 sequences of parent oligonucleotides that are cut at specified cut points; 



-120- 



WO 03/055978 



PCT/US02/34374 



extendingthesequenceofeachfragmentatacut point against a parental template 
to produce a set of oligonucleotide duplexes representing different combinations of 
fragments; 

removing parent homoduplex oligonucleotides; and 

providmgasetofrecombmantsfrommeresultmgheteroduplexoligonucleoto 

145. A method of claim 144, wherein the oligonucleotide duplexes are removed by 
msertingoligonucleotideduplexes^ 

146. A method of producing recombinant oligonucleotides from two or more parent 
oligonucleotides by a PCR amplification method comprising the steps of: 

specifying one or more selected cut points for each parent oligonucleotide; 

defining oligonucleotide fragments having sequences corresponding to the 
sequences of parent oligonucleotides that are cut at specified cut points; 

providing sets of primers, wherein each primer in a set hybridizes to all parent 
strands at a crossover region corresponding to a specified cut point; 

producingasetofdefinedfragments from each parent by PCK amplification with 

each set of primers; and 

assembling fragments in a pool by PCR amplification. 

147. A method of claim 146, wherein: 

each set of primers is a pair of terminal primers or a pair of intervening primers; 
each primer in a terminal pair of primers corresponds to at least one terminal end 

of one parent polymer; and 

each primer in each intervening pair of primers corresponds to a specified cut 

point. 

148. A method of claim 147, wherein PCR amplification is performed using a first 
primer selected a first pair of primers, and a second primer selected from a second pair 
of primers. 
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149. A method of claim 148, wherein first and second primers flank the ends of a 
fragment 

150. A method of producing recombinant oligonucleotides from two or more parent 
oligonucleotides by a family shuffling method comprising the steps of: 

specifying one or more selected crossover locations for each parent 
oligonucleotide; 

providing sets of primer pairs, wherein each primer of each pair comprises 
sequences from two parents which span and include a specified crossover location; 
producing fragments of the parent polymers; 

reassembling the fragments in the presence of the primers using PCR 
amplification. 

151. A method of claim 1, wherein a coupling interaction between a pair of residues 
in thefirst polymer sequence isdisruptedinacrossovermutantiftheidentityofaresidue 
is different in the crossover mutant than in the first polymer sequence, and wherein a 
coupling interaction between a pair of residues is scaled by the probabilities, that the 
identity and sequence position of the coupled residues are the same in both parents. 

152. A method for producing hybrid polymers from two or more parent polymers 

comprising the steps of: 

providing at least two parent polymers, both comprising a polypeptide or a 

polynucleotide; 

identifying structural domains of at least one parent polymer; 
organizing identified domains into schema; 
calculating a schema disruption profile; 

selecting at least one crossover location based on the schema disruption profile; 

and 

recombining two or more parent polymers at one or more selected crossover 
locations to produce at least one library comprising at least one hybrid polymer. 
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153. A method of claim 152, wherein parent polymers are recombined in silico, in 
vitro, in vivo, or in any combination thereof. 

154. A method of claim 153, wherein parent polymers are recombined in a directed 
evolution experiment. 

155. A method of claim 152, wherein crossover locations are selected from a schema 
disruption profile based on a predicted threshold at which the structural tolerance of at 
least one parent polymer is lost. 

1 56. A method of claim 1 52, further comprising the step of screening the library for one 
or more polymer properties. 

157. A method of claim 152, wherein parent polymers 

are recombined in silico to produce at least one candidate hybrid polymer; and 
are physically recombined at one or more crossover locations, including at least 

one selected crossover location, to produce at least one hybrid polymer corresponding to 

a candidate hybrid polymer. 

158. Amethod of claim 1 57, "wherein a library of candidate hybrid pVlymereis compared 
with a library of physically recombined hybrid polymers. 

159. Amethod of claim 154, wherein the directed evolution experiment includes at least 
one protocol selected from the group consisting of fragmentation and reassembly, family 
shuffling, exori shuffling, StEP, ITCHY, synthesis techniques, and PCR-based 
techniques. 

160. A method of claim 152, wherein crossover locations are selected based on a schema 
disruption threshold. 
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161. A method of claim 160, wherein each hybrid in the library comprises parent 
polymers that are recombined at a single crossover location. 

162. A method for producing hybrid polymers from two or more parent polymers 

5 comprising the steps of: 

providing at least two parent polymers, both comprising a polypeptide or a 

polynucleotide; 

identifying structural domains of at least one parent polymer; 
organizing identified domains into schema; 
10 calculating a schema disruption profile based on a disruption threshold; 

selecting at least one crossover location based on the schema disruption profile; 
recombining two or more parent polymers in silico at one or more selected 
crossover locations to produce a hybrid polymer library, and 

predicting one or more properties of hybrid polymers in the library. 

15 

163. A method of claim 162, wherein each hybrid in the library comprises parent 
polymers that are recombined at a single crossover location. 

164. A method of claim 162, further comprising the step of physically recombining 
20 parent polymers at one or more crossover locations, including at least one selected 

crossover location, to produce at least one polymer corresponding to a hybrid polymer 
in the library. 

165. A method of claim 162, wherein each selected crossover location corresponds to 
25 a schema disruption that is below the threshold. 

166. A method of claim 1 52, wherein the schema disruption profile comprises counting 
the interactions between schema to determine groups of schema that are preserved in the 
hybrid polymers. 

30 
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167. A method of claim 162, wherein the schema disruption profile comprises counting 
the interactions between schema to determine groups of schema that are preserved in the 
hybrid polymers. 

5 168. A method of claim 166, wherein the polymer is an enzyme and the property of 

interest is enzymatic activity. 

169. A method of claim 167, wherein the polymer is an enzyme and the property of 
interest is enzymatic activity. 

10 

170. A method of claim 152, wherein the schema disruption profile comprises counting 
the interactions between schema to determine groups of schema that are preserved in the 
hybrid polymers. 

15 17 1 . A method of claim 1 52, wherein the enzyme is beta-lactamase. 

172. A beta-lactamase hybrid comprising the amino acid sequence of PSE-4, substituted 
in part by an amino acid sequence of TEM-1, wherein the substitution is selected from 
the group of: 

2o ammo aciaresidues 164-179 of PSE-4 are replaced by the corresponding amino 

acid residues of TEM-1; 

amino acid residues 190-216 ofPSE-4 are replaced by the corresponding amino 

acid residues of TEM-1 ; 

amino acid residues 71-216 of PSE-4 are replaced by the corresponding amino 

25 acid residues of TEM- 1 ; 

amino acid residues 71-130 of PSE-4 are replaced by the corresponding amino 

acid residues of TEM-1; and 

amino acid residues 254 and higher of PSE-4 are replaced by the corresponding 

amino acids of TEM-1. 

30 
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173. Abeta-lactamase hybrid comprising the amino acid sequence ofTEM-1 , substituted 
in part by an amino acid sequence of PSE-4, wherein the substitution is selected from the 
group of: 

amino acid residues 1 64-1 79 of TEM-1 are replaced by the corresponding amino 
acid residues of PSE-4; 

amino acid residues 190-216 of TEM-1 are replaced by the corresponding amino 
acid residues of PSE-4; 

amino acid residues 71-216 of TEM-1 are replaced by the corresponding amino 
acid residues of PSE-4; 

amino acid residues 71-130 of TEM-1 are replaced by the corresponding amino 
acid residues of PSE-4; and 

amino acid residues 254 and higher of TEM-1 are replaced by the corresponding 
amino acids of PSE-4. 

1 74. A hybrid polymer comprising a first polypeptide recombined with at least a second 
polypeptide at one or more crossover locations selected according to a schema disruption 
threshold. 

175. A hybrid polymer of claim 174, wherein the threshold is based on counting the 
number of interactions between schema in a schema disruption profile. 
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Determining the Schema Disruption Profile for a Structure 
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Experimental Data: 





wt 


wt-insert 


1 


2 


(B) 


Tm(dC) 
Tm(dC) 
t1/2 
t1/2 


52 
49.5 
12.1 

53 


55.2 
53.3 
2586 


n.d. 
44.5 


54.3 
52.5 

-.87.5.. 


138 


4 


308 





Calculations: 





All schema 


Fragments 


Z-score 






av 


stdev 


1 


2 


1 


2 


Ec 


19.260 


4.090 


10.770 


8.124 


-2.076 


-2.723 


Ec* 


0.006 


0.002 


0.014 


0.005 


4.838 


-0.857 
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SEQUENCE LISTING 

<110> California Institute of Technology 
Wang, Zhen-Gang 
Voigt, Christopher A. 
Mayo, Stephen L. 
Arnold, Frances H. 

<120> GENE RECOMBINATION AND HYBRID PROTEIN DEVELOPMENT 

<130> 9373/2H812-W03 

<150> US 10/016,668 
<151> 2001-10-26 

<160> 6 

<170> Patentln version 3.1 

<210> 1 

<211> 361 

<212> PRT 

<213> Enterobacter cloacae 
<300> 

<308> SWIS-PROT / P05364 

<309> 1988-11-09 

<313> (1)..(361) 

<400> 1 

Thr Pro Val Ser Glu Lys Gin Leu Ala Gin Val Val Ala Asn Thr He 
15 10 15 



Thr Pro Leu Met Lys Ala Gin Ser Val Pro Gly Met Ala Val Ala Val 
20 25 30 



He Tyr Gin Gly Lys Pro His Tyr Tyr Thr Phe Gly Lys Ala Asp He 
35 40 45 



Ala Ala Asn Lys Pro Val Thr Pro Gin Thr Leu Phe Glu Leu Gly Ser 
50 55 60 



He Ser Lys Thr Phe Thr Gly Val Leu Gly Gly Asp Ala He Ala Arg 
65 70 75 80 



Gly Glu He Ser Leu Asp Asp Ala Val Thr Arg Tyr Trp Pro Gin Leu 
85 90 95 



Thr Gly Lys Gin Trp Gin Gly He Arg Met Leu Asp Leu Ala Thr Tyr 
100 105 HO 



Thr Ala Gly Gly Leu Pro Leu Gin Val Pro Asp Glu Val Thr Asp Asn 
115 120 125 
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Ala Ser Leu Leu Arg Phe Tyr Gin Asn Trp Gin Pro Gin Trp Lys Pro 
130 135 140 



Gly Thr Thr Arg Leu Tyr Ala Asn Ala Ser He Gly Leu Phe Gly Ala 
145 150 155 * 160 



Leu Ala Val Lys Pro Ser Gly Met Pro Tyr Glu Gin Ala Met Thr Thr 
165 170 175 



Arg Val Leu Lys Pro Leu Lys Leu Asp His Thr Trp He Asn Val Pro 
180 185 190 



Lys Ala Glu Glu Ala His Tyr Ala Trp Gly Tyr Arg Asp Gly Lys Ala 
195 200 205 



Val Arg Val Ser Pro Gly Met Leu Asp Ala Gin Ala Tyr Gly Val Lys 
210 215 220 



Thr Asn Val Gin Asp Met Ala Asn Trp Val Met Ala Asn Met Ala Pro 
225 230 235 240 



Glu Asn Val Ala Asp Ala Ser Leu Lys Gin Gly He Ala Leu Ala Gin 
245 250 255 



Ser Arg Tyr Trp Arg He Gly Ser Met Tyr Gin Gly Leu Gly Trp Glu 
260 265 270 



Met Leu Asn Trp Pro Val Glu Ala Asn Thr Val Val Glu Gly Ser Asp 
275 280 285 



Ser Lys Val Ala Leu Ala Pro Leu Pro Val Ala Glu Val Asn Pro Pro 
290 295 300 



Ala Pro Pro Val Lys Ala Ser Trp Val His Lys Thr Gly Ser Thr Gly 
305 310 315 320 



Gly Phe Gly Ser Tyr Val Ala Phe He Pro Glu Lys Gin He Gly He 
325 330 335 



Val Met Leu Ala Asn Thr Ser Tyr Pro Asn Pro Ala Arg Val Glu Ala 
340 345 350 



Ala Tyr His He Leu Glu Ala Leu Gin 
355 360 
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<210> 


2 




<211> 


361 




<212> 


PRT 




<213> 


Citrobacter 


freundii 


<300> 






<308> 


SWIS-PROT / 


P05193 


<309> 


1987-08-05 




<313> 


(1) (361) 




<400> 


2 





Ala Ala Lys Thr Glu Gin Gin lie Ala Asp lie Val Asn Arg Thr lie 
15 10 15 



Thr Pro Leu Met Gin Glu Gin Ala lie Pro Gly Met Ala Val Ala lie 
20 25 30 



lie Tyr Glu Gly Lys Pro Tyr Tyr Phe Thr Trp Gly Lys Ala Asp lie 
35 40 45 



Ala Asn Asn His Pro Val Thr Gin Gin Thr Leu Phe Glu Leu Gly Ser 
50 55 60 



Val Ser Lys Thr Phe Asn Gly Val Leu Gly Gly Asp Arg lie Ala Arg 
65 70 75 80 



Gly Glu He Lys Leu Ser Asp Pro Val Thr Lys Tyr Trp Pro Glu Leu 
85 90 95 



Thr Gly-Lys Gin Trp-Arg Gly- lie Ser -Leu -Leu His Leu-Ala Thr Tyr 
100 105 110 



Thr Ala Gly Gly Leu Pro Leu Gin He Pro Gly Asp Val Thr Asp Lys 
115 120 125 



Ala Glu Leu Leu Arg Phe Tyr Gin Asn Trp Gin Pro Gin Trp Thr Pro 
130 135 140 



Gly Ala Lys Arg Leu Tyr Ala Asn Ser Ser He Gly Leu Phe Gly Ala 
145 150 155 160 



Leu Ala Val Lys Ser Ser Gly Met Ser Tyr Glu Glu Ala Met Thr Arg 
165 170 175 



Arg Val Leu Gin Pro Leu Lys Leu Ala His Thr Trp He Thr Val Pro 
180 185 190 
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Gin Ser Glu Gin Lys Asn Tyr Ala Trp Gly Tyr Leu Glu Gly Lys Pro 
195 200 205 



Val His Val Ser Pro Gly Gin Leu Asp Ala Glu Ala Tyr Gly Val Lys 
210 215 220 



Ser Ser Val lie Asp Met Ala Arg Trp Val Gin Ala Asn Met Asp Ala 
225 230 ~ 235 240 



Ser His Val Gin Glu Lys Thr Leu Gin Gin Gly He Glu Leu Ala Gin 
245 250 255 



Ser Arg Tyr Trp Arg He Gly Asp Met Tyr Gin Gly Leu Gly Trp Glu 
260 * 265 270 



Met Leu Asn Trp Pro Leu Lys Ala Asp Ser He He Asn Gly Ser Asp 
275 280 285 



Ser Lys Val Ala Leu Ala Ala Leu Pro Ala Val Glu Val Asn Pro Pro 
290 295 300 



Ala Pro Ala Val Lys Ala Ser Trp Val His Lys Thr Gly Ser Thr Gly 
305 310 315 320 



Gly Phe Gly Ser Tyr Val Ala Phe Val Pro Glu Lys Asn Leu Gly lie 
325 330 335 



Val Met Leu Ala Asn Lys Ser Tyr Pro Asn Pro Ala Arg Val Glu Ala 
- 340 345 -350 ■ • ■ 



Ala Trp Arg lie Leu Glu Lys Leu Gin 
355 360 



<210> 


3 


<211> 


361 


<212> 


PRT 


<213> 


Yersinia enterocolitica 


<300> 




<308> 


SWIS-PROT / P45460 


<309> 


1995-11-01 


<313> 


(1)..(361) 


<400> 


3 



Thr Lys Leu Thr Glu Leu Gin Val Ala Thr He Val Asn Asn Thr Leu 
15 10 15 



Thr Pro Leu Leu Glu Lys Gin Gly lie Pro Gly Met Ala Val Ala Val 
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20 
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25 
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30 



Phe Tyr Asp Gly Lys Pro Gin Phe Phe Asn Tyr Gly Met Ala Asp lie 
35 40 45 



Lys Ala Gly Arg Pro Val Thr Glu Asn Thr Leu Phe Glu Leu Gly Ser 
50 55 60 



Val Ser Lys Thr Phe Thr Gly Val Ala Gly Glu Tyr Ala Met Gin Thr 
65 70 75 80 



Gly lie Met Asn Leu Asn Asp Pro Val Thr Glu Tyr Ala Pro Glu Leu 
85 90 95 



Thr Gly Ser Gin Trp Lys Asp Val Lys Met Leu His Leu Ala Thr Tyr 
100 105 110 



Thr Ala Gly Gly Leu Pro Leu Gin Leu Pro Asp Ser Val Thr Asp Gin 
115 120 125 



Lys Ser Leu Trp Gin Tyr Tyr Gin Gin Trp Gin Pro Gin Trp Ala Pro 
130 135 140 



Gly Val Met Arg Asn Tyr Ser Asn Ala Ser lie Gly Leu Phe Gly Ala 
145 150 155 160 



Leu Ala Val Lys Arg Ser Gin Leu Thr Phe Glu Asn Tyr Met Lys Glu 
165 170 175 



Tyr Val Phe Gin Pro Leu Lys Leu Asp His Thr Phe lie Thr lie Pro 
180 185 190 



Glu Ser Met Gin Ser Asn Tyr Ala Trp Gly Tyr Lys Asp Gly Gin Pro 
195 200 205 



Val Arg Val Thr Leu Gly Met Leu Gly Glu Glu Ala Tyr Gly Val Lys 
210 215 220 



Ser Thr Ser Gin Asp Met Val Arg Phe Met Gin Ala Asn Met Asp Pro 
225 230 235 240 



Glu Ser Leu Gly Asn Asp Lys Leu Lys Glu Ala He He Ala Ser Gin 
245 250 255 



Ser Arg Tyr Phe Gin Ala Gly Asp Met Phe Gin Gly Leu Gly Trp Glu 
260 265 270 
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Met Tyr Ser Trp Pro lie Asn Pro Gin Gly Val lie Ala Asp Ser Gly 
275 280 285 



Asn Asp He Ala Leu Lys Pro Arg Lys Val Glu Ala Leu Val Pro Ala 
290 295 300 



Gin Pro Ala Val Arg Ala Ser Trp Val His Lys Thr Gly Ala Thr Asn 
305 310 315 320 



Gly Phe Gly Ala Tyr He Val Phe He Pro Glu Glu Lys Val Gly He 
325 330 335 



Val Met Leu Ala Asn Lys Asn Tyr Pro Asn Pro Val Arg Val Gin Ala 
340 345 350 



Ala Tyr Asp He Leu Gin Ala Leu Arg 

360 





355 


<210> 


4 


<211> 


359 


<212> 


PRT 


<213> 


Klebsiella pneumoniae 


<300> 




<308> 


SWISPROT / Q4 8437 


<309> 


1996-11-01 


<313> 


(1) (359) 


<400> 


4 



Tyr Ala Arg Gly Glu Ala Pro Leu Thr Ala Ala Val Asp Gly He He 
1 5 10 15 



Gin Pro Met Leu Lys Glu Tyr Arg He Pro Gly Met Ala Val Ala Val 
20 25 30 



Leu Lys Asp Gly Lys Ala His Tyr Phe Asn Tyr Gly Val Ala Asn Arg 
35 40 45 



Glu Ser Gly Gin Arg Val Ser Glu Gin Thr Leu Phe Glu He Gly Ser 
50 55 60 



Val Ser Lys Thr Leu Thr Ala Thr Leu Gly Ala Tyr Ala Ala Val Lys 
65 70 75 80 



Gly Gly Phe Glu Leu Asp Asp Lys Val Ser Gin His Ma Pro Trp Leu 
85 90 95 



/ 
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Lys Gly Ser Ala Phe Asp Gly Val Thr Met Ala Glu Leu Ala Thr Tyr 
100 105 110 



Ser Ala Gly Gly Leu Pro Leu Gin Phe Pro Asp Glu Val Asp Ser Asn 
115 120 125 



Asp Lys Met Arg Thr Tyr Tyr Arg His Trp Ser Pro Val Tyr Pro Ala 
130 135 140 



Gly Thr His Arg Gin Tyr Ser Asn Pro Ser He Gly Leu Phe Gly His 
145 150 155 160 



Leu Ala Ala Asn Ser Leu Gly Gin Pro Phe Glu Gin Leu Met Ser Gin 
165 170 175 



Thr Leu Leu Pro Lys Leu Gly Leu His His Thr Tyr He Gin Val Pro 
180 185 190 



Glu Ser Ala He Ala Asn Tyr Ala Tyr Gly Tyr Lys Glu Asp Lys Pro 
195 200 205 



Val Arg Val Thr Pro Gly Val Leu Ala Ala Glu Ala Tyr Gly He Lys 
210 215 220 



Thr Gly Ser Ala Asp Leu Leu Lys Phe Thr Glu Ala Asn Met Gly Tyr 
225 230 235 240 



Gin Gly Asp Ala Ala ieu Lys Thr Arg. lie Ala Leu Thr. His Thr Gly. 

245 250 255 



Phe Tyr Ser Val Gly Asp Met Thr Gin Gly Leu Gly Trp Glu Ser Tyr 
260 265 270 



Ala Tyr Pro Leu Thr Glu Gin Ala Leu Leu Ala Gly Asn Ser Pro Ala 
275 280 285 



Val Ser Phe Gin Ala Asn Pro Val Thr Arg Phe Ala Val Pro Lys Ala 
290 295 300 



Met Gly Glu Gin Arg Leu Tyr Asn Lys Thr Gly Ser Thr Gly Gly Phe 
305 310 315 320 



Gly Ala Tyr Val Ala Phe Val Pro Ala Arg Gly He Ala He Val Met 
325 330 335 
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Leu Ala Asn Arg Asn Tyr Pro He Glu Ala Arg Val Lys Ala Ala His 
340 345 350 

Ala He Leu Ser Gin Leu Ala 
355 



<210> 


5 




<211> 


286 




<212> 


PRT 




<213> 


Escherichia 


coli 


<300> 






<308> 


SWIS-PROT / 


P00810 


<309> 


1986-07-01 




<313> 


(1) . . (286) 




<400> 


5 





Met Ser He Gin His Phe Arg Val Ala Leu He Pro Phe Phe Ala Ala 
15 10 15 

Phe Cys Leu Pro Val Phe Ala His Pro Glu Thr Leu Val Lys Val Lys 
20 25 30 

Asp Ala Glu Asp Gin Leu Gly Ala Arg Val Gly Tyr He Glu Leu Asp 
35 40 45 

Leu Asn Ser Gly Lys He Leu Glu Ser Phe Arg Pro Glu Glu Arg Phe 
50 55 60 

Pro Met Met Ser Thr Phe Lys Val Leu Leu Cys Gly Ala Val Leu Ser 

65 . 70 75 - 80 

Arg Val Asp Ala Gly Gin Glu Gin Leu Gly Arg Arg He His Tyr Ser 
85 90 95 

Gin Asn Asp Leu Val Glu Tyr Ser Pro Val Thr Glu Lys His Leu Thr 
100 105 110 

Asp Gly Met Thr Val Arg Glu Leu Cys Ser Ala Ala He Thr Met Ser 
115 120 125 

Asp Asn Thr Ala Ala Asn Leu Leu Leu Thr Thr He Gly Gly Pro Lys 
130 135 140 

Glu Leu Thr Ala Phe Leu His Asn Met Gly Asp His Val Thr Arg Leu 
145 150 155 160 



Asp Arg Trp Glu Pro Glu Leu Asn Glu Ala He Pro Asn Asp Glu Arg 
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165 170 175 



Asp Thr Thr Met Pro Ala Ala Met Ala Thr Thr Leu Arg Lys Leu Leu 
180 185 190 



Thr Gly Glu Leu Leu Thr Leu Ala Ser Arg Gin Gin Leu He Asp Trp 
195 200 205 



Met Glu Ala Asp Lys Val Ala Gly Pro Leu Leu Arg Ser Ala Leu Pro 
210 215 220 



Ala Gly Trp Phe He Ala Asp Lys Ser Gly Ala Gly Glu Arg Gly Ser 
225 230 235 240 



Arg Gly lie He Ala Ala Leu Gly Pro Asp Gly Lys Pro Ser Arg He 
245 250 255 



Val Val He Tyr Thr Thr Gly Ser Gin Ala Thr Met Asp Glu Arg Asn 
260 265 270 



Arg Gin lie Ala Glu He Gly Ala Ser Leu He Lys His Trp 
275 280 285 



<210> 


6 


<211> 


288 


<212> 


PRT 


<213> 


Pseudomonas aeruginosa 


<300> 




<3 08> 


SWIS.rPROT / P16897 


<309> 


1990-08-15 


<313> 


(1) (288) 


<400> 


6 


Met Lys 


Phe Leu Leu Ala Phe Ser 


1 


5 



10 15 



Ala Ser Ser Ser Lys Phe Gin Gin Val Glu Gin Asp Val Lys Ala He 
20 25 30 



Glu Val Ser Leu Ser Ala Arg He Gly Val Ser Val Leu Asp Thr Gin 
35 40 45 



Asn Gly Glu Tyr Trp Asp Tyr Asn Gly Asn Gin Arg Phe Pro Leu Thr 
50 55 60 



Ser Thr Phe Lys Thr He Ala Cys Ala Lys Leu Leu Tyr Asp Ala Glu 
65 70 75 80 
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Gin Gly Lys Val Asn Pro Asn Ser Thr Val Glu lie Lys Lys Ala Asp 
85 90 95 



Leu Val Thr Tyr Ser Pro Val He Glu Lys Gin Val Gly Gin Ala He 
100 105 110 



Thr Leu Asp Asp Ala Cys Phe Ala Thr Met Thr Thr Ser Asp Asn Thr 
115 120 125 



Ala Ala Asn He He Leu Ser Ala Val Gly Gly Pro Lys Gly Val Thr 
130 135 140 



Asp Phe Leu Arg Gin He Gly Asp Lys Glu Thr Arg Leu Asp Arg He 

145 150 155 160. 

Glu Pro Asp Leu Asn Glu Gly Lys Leu Gly Asp Leu Arg Asp Tnr Thr 

165 170 * 175 



Thr Pro Lys Ala He Ala Ser Thr Leu Asn Lys Phe Leu Phe Gly Ser 
180 185 190 



Ala Leu Ser Glu Met Asn Gin Lys Lys Leu Glu Ser Trp Met Val Asn 
195 200 205 



Asn Gin Val Thr Gly Asn Leu Leu Arg Ser Val Leu Pro Ala Gly Trp 
210 215 220 



Asn He Ala Asp Arg Ser Gly Ala Gly Gly Phe Gly Ala Arg Ser He 
225 230 235 240 



Thr Ala Val Val Trp Ser' Glu His Gin Ala Pro He He Val Ser He 
245 250 255 



Tyr Leu Ala Gin Thr Gin Ala Ser Met Glu Glu Arg Asn Asp Ala He 
260 265 270 



Val Lys He Gly His Ser He Phe Asp Val Tyr Thr Ser Gin Ser Arg 
275 280 285 



